How do I know If my model is performing well in the real world?

Real World Performance Featured image 1200x630

You’ve gathered and cleaned your training data, fine-tuned your feature engineering, tested the accuracy of your model, and now you’re ready to push it to production but you don’t know if it will perform as expected in the real world. Training and validation data can often differ from production data: features may drift, conceptual soundness may change, the data pipeline or the schema may change. A model that once had reasonable accuracy can very quickly become mis-leading and inaccurate. Knowing when a model has become stale or your real world data has changed helps you know when you need to retrain your model or update your training data or feature engineering. This is why monitoring your models is such an important part of the model development life-cycle.

Often model monitoring focuses just on knowing what is causing performance degradation or critical errors to help you more quickly and accurately assess how to improve both your models. There’s more to monitoring than just finding errors though. Comprehensively monitoring models in production enables you to improve your model development process, improve your model deployments, and to facilitate team collaboration across your team. We’ll look at how each of these three key areas can be improved with flexible and robust model monitoring.

Improving Model Performance

It’s essential to get insights about the behavior of a specific model but it’s also important to be able to compare two or more versions of a model to track different campaigns or training datasets. Comparing models across specific time periods and for segments of interest can help you see where features might be drifting or populations that a specific model predicts incorrectly. Imagine that you have three fraud detection models in production and you want to track their accuracy for your high value customers. By monitoring how your models perform on specific segments you can catch performance degradation before it impacts your business and see what features are causing performance drop. You should be able to inspect performance manually but you also want to be able to create alerts that will notify your team as soon as performance drops below a defined threshold. You can specify what metrics and segments are most important to you, who should be alerted, and track the history of how models or segments are performing. Those metrics can be the drift of a specific feature or a change in the feature importance for a segment of inputs that are of high value to you or your team.

Real World Performance Figure 1

Improving Deployments

There’s more to deploying a model in production than deploying one and keeping it there until it’s replaced by another. One common option is a shadow deployment where a second model receives input and generates predictions but those aren’t returned live to end users. This is usually done by duplicating the requests and sending all production traffic to both the  deployed model and to the shadow model. We log all requests and  model predictions. We  review them later, check if we observe surprising predictions. Another common deployment strategy is a canary release which uses two models: an older model which generates the majority of the inferences and a newer model that generates a small percentage of them. Over time, we slowly increase the amount of traffic handled by the new model and observe its performance until we’re confident that it will perform well enough to fully replace the older model. Either of these approaches require a monitoring solution that allows you to accurately compare the performance of multiple models with appropriate time granularity and aggregation strategies.

Improving Operations

Once a model is in production, its performance and your ML pipeline are important to multiple stakeholders, including data science managers, business executives, and infrastructure teams responsible for the ML pipeline. Monitoring can be as simple as alerts sent to a data scientist when a model’s performance drops but there’s richer ways to bring your whole team together if you have reporting, charts, graphs, and observability dashboards for non-programmers to see how models and the data pipeline are performing. For business critical models, giving different users visibility into performance and system health is a vital way to establish a common universal view onto your systems.

ML Monitoring starts with performance monitoring but can be a key driver for a robust MLOps practice. If you’d like to learn more about how Truera’s Monitoring can help you build your ML Observability and MLOps toolkit, you should watch our Monitoring webinar series

Last modified on June 22nd, 2023