The Verta team recently attended the Data+AI Summit 2022 conference put on by Databricks in the Moscone Center in San Francisco. It was great to be meeting again with peers in the AI/ML community, and to see the energy and excitement around machine learning.
My previous post on the conference touched on three key takeaways that emerged from the buzz our team heard at the event, touching on operationalization, model catalogs and an “EHR for models.” In this post, I’d like to highlight the topic of our talk on orchestrating ML deployments with Jenkins, co-presented by Conrado Miranda, Verta’s CTO, and Liam Newman, senior software engineer at Verta.
The underlying concept here is that models essentially are a piece of software. The way that you build models is different from how you build other pieces of software, of course, but at the end of the day, models are just a component of software. That being the case, you are going to have to employ similar practices in deploying models as you would for other applications.
But there are nuances. Yes, ML is like software, but there are challenges in applying CI/CD to ML. The data scientists who are building the models most likely never learned – simply because it was never part of their curriculum – how to deploy software, or how to use version control, and so on. Data scientists generally just don’t know what happens in that part of the stack. In addition, the ML tooling is still evolving; my own belief is that ML is where software was maybe 20 years ago in this respect.
Put Controls at the Transition from Research to Production
Liam, who has a rich background in CI/CD, provided an overview of CI/CD practices, and we discussed several examples of applying CI/CD pipelines to ML models, and how to take a software delivery pipeline and convert that to a machine learning delivery pipeline.
The presentation prompted a number of follow-up questions, including about best practices for how and where to put controls into the process. The issue here is that, because data science is typically very ad hoc, you want to give your data scientists the leeway to do their experimentation in the way that best fits their workflow. But at some point, you need to bring the model to production. So the question arises, at what point do you put in the controls: When they’re developing the models, when they’re taking it to production, or somewhere else in the process?
Another question that came up was around the division of responsibilities between data scientists and ML engineers. We at Verta strongly believe that Data Science needs to own the model end-to-end. They don’t need to do all the CI/CD tooling, but they do need to own the model because if it produces incorrect results or fails once it’s deployed, there needs to be an expert in modeling to figure out why it failed.
Overall, it was wonderful to be back at the Data+AI Summit. If we didn’t connect in the hallways or at the Intel booth, we’ll look forward to seeing you soon!
Use the link below to download a copy of Verta’s presentations from the Data+AI Summit 2022.