Across industries and verticals, organizations are operationalizing AI/ML models in a growing number of intelligent applications. However, when these models begin to act up, sometimes the organizations themselves are the last ones to know about it, e.g., they don't hear about performance degradation until users complain. And by then, it's usually too late. Organizations have to go into full damage control mode — finding the root cause and fixing broken models while simultaneously placating frustrated users.
Or, worse yet, sometimes model degradation can be almost imperceptible to a human user. At an enterprise scale, even small degradations can be costly - directly and indirectly - when they are left to run unchecked for any significant length of time.
Compared to traditional software, the lack of visibility surrounding ML models is complex. DevOps engineers don't push a new feature live without a way to monitor performance. But monitoring across the AI/ML lifecycle is much trickier. The rest of this blog unpacks the reasons why and offers a few best practices.
As anyone with experience in MLOps will tell you, AI/ML models don't always work as expected. Before we look at model monitoring, let's consider a few of the reasons why models fail or suffer performance degradation:
While all of the above can (and do) cause AI/ML-enabled products to fail or degrade, in our experience, problems related to data are the primary culprit in most cases.
Model monitoring ensures consistently high-quality results from a model, enabling an organization to:
A robust model monitoring system will provide visibility in the following areas:
Model quality metrics like accuracy, precision, recall, F1 score, and MSE are a few more common ways to measure model performance. However, organizations often use different metrics for different model types and/or business needs. For example:
Once the most relevant quality metrics are identified (and, ideally, ground truth is established), a monitoring system can help compute, track and detect performance deviations. When deviations are detected, a sound monitoring system will give you the ability to drill down and perform root cause analysis.
Computing model performance using ground truth can pose challenges. Generating ground truth usually involves hand labeling, making ground truth challenging to access in a timely manner. For best results, look for the ability to ingest delayed ground truth labels in a monitoring system. In the absence of ground truth, we recommend using other proxies (e.g., click-thru rates for a recommendation engine) for real-time quality feedback. Automate as much of this process as possible and don't rely on ad-hoc steps.
Data drift is one of the key reasons model performance degrades over time. Data drift can be used as a leading indicator for model failures. Drift monitoring allows you to track distributions of input features, output predictions, and intermediate results, and it should also allow you to detect changes.
There are two distinct types of drift tracking:
Different algorithms are used for drift detection, e.g. cosine distance, KL divergence, Population Stability Index (PSI), etc. Your data types will determine which drift detection algorithm you should choose, and your monitoring system should offer flexibility to choose.
Your monitoring solution should allow you to configure drift thresholds both manually and automatically and give you the ability to get alerted when data drift is detected. For immediate remediation, consider building an automated retraining workflow that is triggered if certain features cross a drift threshold.
Outlier detection is a great way to track anomalous input and output data. Most often, outliers can help detect issues with data pipelines. For more sensitive models, use outliers to identify edge cases that require manual processing or further review.
Univariate outlier analysis is a good start when tracking outliers for a prediction or a single input feature over time. However, for most production use cases, you may need to analyze impact across multiple variables; hence, multivariate outlier detection is preferred.
Basic data quality monitoring (such as missing data, null values, standard deviation, mean, median) can be extremely helpful in production.
Along with model performance and data, organizations should monitor the overall service health of models through operational metrics including:
Inference service latency has a massive impact on user experience for real-time and near-real-time systems and needs to have stringent SLAs. Typically, a model monitoring system built for AI/ML may not be the platform of choice for service metrics. You're better served sending those metrics to an APM or IT infrastructure monitoring system most of the time.
We've created four best practices to keep in mind for those just starting to consider a model monitoring system.
Allow yourself to take immediate action when necessary. If you are dumping data into a database for post-processing or following ad-hoc monitoring practices, you're introducing significant risk to your organization.
Establishing seamless connections here can help with faster root cause analysis and issue resolution.
Don't put yourself in a situation where you're relying on other data scientists or ML engineers to configure features, set up alerts, etc. Model monitoring should be integrated into your model management systems and associated workflows.
Metrics and techniques used for model monitoring vary widely depending on data and model types. Monitoring systems should be customizable and extensible, allowing you to monitor various data types. You should be able to compute and visualize data in various ways for different audiences.
Looking for more? Download our Intro Guide to Model Monitoring and check out Verta's Model Monitoring capabilities here!