Machine Learning Model Monitoring | Verta.ai

Written by Meeta Dash | February 10, 2022

Across industries and verticals, organizations are operationalizing AI/ML models in a growing number of intelligent applications. However, when these models begin to act up, sometimes the organizations themselves are the last ones to know about it, e.g., they don't hear about performance degradation until users complain. And by then, it's usually too late. Organizations have to go into full damage control mode — finding the root cause and fixing broken models while simultaneously placating frustrated users.

Or, worse yet, sometimes model degradation can be almost imperceptible to a human user. At an enterprise scale, even small degradations can be costly - directly and indirectly - when they are left to run unchecked for any significant length of time.

Compared to traditional software, the lack of visibility surrounding ML models is complex. DevOps engineers don't push a new feature live without a way to monitor performance. But monitoring across the AI/ML lifecycle is much trickier. The rest of this blog unpacks the reasons why and offers a few best practices.

Why do models fail?

As anyone with experience in MLOps will tell you, AI/ML models don't always work as expected. Before we look at model monitoring, let's consider a few of the reasons why models fail or suffer performance degradation:

Data in production often differ significantly from data used to build/train a model. Usually, the model is exposed to wildly different kinds of data the moment it's released to production, which can lead to unpredictable behavior
Complex data pipelines and frequently updated models increase potential points of failure and make it difficult to know why or when model performance has changed
Sometimes a model becomes stale when the data used to build it is simply no longer relevant
The underlying data-generating processes may change
The AI/ML product or service may be used in new markets or with new users, which can lead to data and/or concept drift
There could simply be bugs or errors in the models, ETL, and/or serving code

While all of the above can (and do) cause AI/ML-enabled products to fail or degrade, in our experience, problems related to data are the primary culprit in most cases.

What is model monitoring?

Model monitoring ensures consistently high-quality results from a model, enabling an organization to:

Get real-time insights and alerts about model performance
Monitor data characteristics
Detect and debug anomalies
Initiate proactive actions to improve AI/ML applications

Key components of model monitoring

A robust model monitoring system will provide visibility in the following areas:

1. Model performance

Model quality metrics like accuracy, precision, recall, F1 score, and MSE are a few more common ways to measure model performance. However, organizations often use different metrics for different model types and/or business needs. For example:

IOU (Intersection Over Union) score for a computer vision model
AUC (Area Under Curve) and confusion matrix for a classification model
Perplexity and Word Error Rate for an NLP model

Once the most relevant quality metrics are identified (and, ideally, ground truth is established), a monitoring system can help compute, track and detect performance deviations. When deviations are detected, a sound monitoring system will give you the ability to drill down and perform root cause analysis.

Computing model performance using ground truth can pose challenges. Generating ground truth usually involves hand labeling, making ground truth challenging to access in a timely manner. For best results, look for the ability to ingest delayed ground truth labels in a monitoring system. In the absence of ground truth, we recommend using other proxies (e.g., click-thru rates for a recommendation engine) for real-time quality feedback. Automate as much of this process as possible and don't rely on ad-hoc steps.

2. Data drift

Data drift is one of the key reasons model performance degrades over time. Data drift can be used as a leading indicator for model failures. Drift monitoring allows you to track distributions of input features, output predictions, and intermediate results, and it should also allow you to detect changes.

There are two distinct types of drift tracking:

Drift over time helps see a sudden/gradual shift in production data (If you are tracking drift over time, make sure to take seasonality effects into account!)
Drift compared to training baseline helps show if feature distribution has changed significantly between training and production

Different algorithms are used for drift detection, e.g. cosine distance, KL divergence, Population Stability Index (PSI), etc. Your data types will determine which drift detection algorithm you should choose, and your monitoring system should offer flexibility to choose.

Your monitoring solution should allow you to configure drift thresholds both manually and automatically and give you the ability to get alerted when data drift is detected. For immediate remediation, consider building an automated retraining workflow that is triggered if certain features cross a drift threshold.

3. Outliers

Outlier detection is a great way to track anomalous input and output data. Most often, outliers can help detect issues with data pipelines. For more sensitive models, use outliers to identify edge cases that require manual processing or further review.

Univariate outlier analysis is a good start when tracking outliers for a prediction or a single input feature over time. However, for most production use cases, you may need to analyze impact across multiple variables; hence, multivariate outlier detection is preferred.

Basic data quality monitoring (such as missing data, null values, standard deviation, mean, median) can be extremely helpful in production.

4. Model service health

Along with model performance and data, organizations should monitor the overall service health of models through operational metrics including:

Response time
Latency
Error rate
Throughput

Inference service latency has a massive impact on user experience for real-time and near-real-time systems and needs to have stringent SLAs. Typically, a model monitoring system built for AI/ML may not be the platform of choice for service metrics. You're better served sending those metrics to an APM or IT infrastructure monitoring system most of the time.

Four best practices for model monitoring

We've created four best practices to keep in mind for those just starting to consider a model monitoring system.

1. Model monitoring should occur in real-time

Allow yourself to take immediate action when necessary. If you are dumping data into a database for post-processing or following ad-hoc monitoring practices, you're introducing significant risk to your organization.

2. Connect model monitoring to both pre-production workflows and production models

Establishing seamless connections here can help with faster root cause analysis and issue resolution.

3. Model monitoring should be fully automated as soon as a model is deployed

Don't put yourself in a situation where you're relying on other data scientists or ML engineers to configure features, set up alerts, etc. Model monitoring should be integrated into your model management systems and associated workflows.

4. Model monitoring should fit your needs — not the other way around

Metrics and techniques used for model monitoring vary widely depending on data and model types. Monitoring systems should be customizable and extensible, allowing you to monitor various data types. You should be able to compute and visualize data in various ways for different audiences.

Looking for more? Download our Intro Guide to Model Monitoring and check out Verta's Model Monitoring capabilities here!

View full post