At Verta, we ran our first ModelDB 2.0 webinar last week and it was a lot of fun. This blog post is a recap of the hands-on tutorial part of the webinar. For the full webinar content, check out the webinar recording on the Verta Youtube channel and the slides on Slideshare.
ModelDB is an open-source model versioning, metadata, and experiment management system originally developed at MIT’s Computer Science Department and now maintained by Verta.ai.
ModelDB 2.0, the latest release of ModelDB, brings git-like versioning to ML models (and all types of analytical products). By applying versioning to the ingredients of ML models, in particular, code, data, config, and environment, ModelDB captures all the components necessary to reproduce a model. Best of all, ModelDB requires no change to your data science/ML workflow. It can be integrated into a Jupyter notebook or used from the command-line.
Read more about the architecture of ModelDB and the motivation behind its design here.
For our first ModelDB walkthrough, we used a simple tabular data example. Specifically, we use the census income dataset that has been lightly preprocessed to make modeling easy. The task here is to build a model to predict a person’s income level (<$50K or > $50K) given demographic information such as age, sex, education, zip code, etc.
We will build a few simple models and focus on making all of our models reproducible by versioning them with ModelDB.
For this tutorial, we put together a simple notebook that builds different models on the census income dataset. Let’s go over the key parts of this notebook.
We then pick the best model. That’s it — simple and straightforward.
Now, if you’re like most data scientists (including myself!) you aren’t versioning your models beyond naming your different notebooks in awkward ways or at most a spreadsheet with your experimentation details.
But this solution is inadequate when building hundreds of models, answering regulatory questions, or sharing work. Here’s where a purpose-built model versioning system like ModelDB really shines.
ModelDB is available under the Apache 2 open-source license and can be set up in a variety of ways as described here and on the ModelDB Git Repo.
The easiest way to set up ModelDB is to run:
With those few lines of code, your models are now reproducible. You can go back to any version of your model, from any time and re-create it. You can see changes you made, collaborate with friends, and share your models.
View the different iterations that your model went through.
Finally, share your work with colleagues and have them be able to fully reproduce the model!
This was a quick peek into the functionality provided by ModelDB 2.0 and how it can be used to make models (or analyses) reproducible. Want to find out more?
Manasi Vartak is the founder and CEO of Verta, an MIT spinoff MLOps software for the entire model lifecycle. Verta grew out of Manasi’s Ph.D. work at MIT on ModelDB, the first open-source model management system widely used in research labs and Fortune 500 companies. Previously, Manasi worked on deep learning for content recommendation at Twitter and dynamic ad-targeting at Google. Manasi has spoken at several top research as well as industrial conferences such as the O’Reilly AI Conference, SIGMOD, VLDB, Spark + AI Summit, and AnacondaCON, and has authored a course on model management.
Verta builds open-core MLOps software for the full model lifecycle. This includes model versioning, model deployment release, to model monitoring, all tied together with enterprise governance features. We are a spin-out of MIT CSAIL where we built ModelDB, one of the first open-source model management systems.