Machine Learning Metadata (MLMD) : A Library To Track Full Lineage Of Machine Learning Workflow

Version control is used to keep track of modifications made in a software code. Similarly, when building machine learning (ML) systems, it is essential to track things, such as the datasets used to train the model, the hyperparameters and pipeline used, the version of tensorflow used to create the model, and many more.

ML artifacts’ history and lineage are very complicated than a simple, linear log. Git can be used to track the code to one extent, but we need something to track your models, datasets, and more. The complexity of ML code and artifacts like models, datasets, and much more requires a similar approach.

Therefore, the researchers have introduced Machine Learning Metadata (MLMD), a standalone library to track one’s entire ML workflow’s full lineage from data ingestion, data preprocessing, validation, training, evaluation, deployment, etc. MLMD also comes integrated with TensorFlow Extended

Beyond versioning your model, ML Metadata captures the training process’s full lineage, including the dataset, hyperparameters, and software dependencies. As an ML Engineer, one can use MLMD to trace wrong models back to their dataset and even trace from a wrong dataset to the models one trained on it. While working in ML infrastructure, one can also use MLMD to record their pipeline’s current state and enable event-based orchestration. Users can also allow optimizations like skipping a step if the inputs and code are the same, memoizing steps in your pipelines. MLMD can be integrated into the training system to create logs for querying later automatically. This auto-logging of the full lineage of training is the best way to use MLMD as it holds the complete history without extra effort.

MLMD is a crucial foundation for multiple internal MLOps solutions at Google. Furthermore, Google Cloud integrates tools like MLMD into its core MLOps platform:

The foundation of all these services is the ML Metadata Management service in the AI Platform allowing AI teams to track all the necessary artifacts and experiments they run, providing a curated ledger of actions and detailed model lineage. This helps users determine model provenance for any AI model train for debugging, audit, or collaboration. AI Platform Pipelines will track artifacts and lineage automatically, and AI teams can use the ML Metadata service directly for custom workloads, antiques, and metadata tracking.




Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

🐝 [FREE AI WEBINAR] 'Beginners Guide to LangChain: Chat with Your Multi-Model Data' Dec 11, 2023 10 am PST