Embedding model quality into ML Ops at scale

Andy McMahon and Greig Cowan share insights about their journey in our Trustworthy AI

For many large Enterprises, value from Machine Learning (ML) initiatives remains stubbornly elusive. Models can take too much time and effort to get into production, or not get there at all. When they do, real-life outcomes can disappoint.

ML Ops – defined as a core function of ML engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them – is often viewed as the solution. And done right, it can be a catalyst for the broader transformation needed to scale ML adoption.

However, too often, ML Ops becomes a narrow, technology/ tooling initiative, with inadequate attention on the necessary changes in data, process and people (skills/ culture). The focus is often on speeding up the throughput of models through the model lifecycle, with model quality left as an afterthought.

One firm that has recognised the need to address both aspects is the UK’s NatWest Group. We spoke to two leading practitioners at NatWest, Andy McMahon and Greig Cowan, to learn about their journey in our Trustworthy AI podcast series. Andy heads ML Ops there, while Greig heads Data Science and Engineering for Data Innovation.  

You can listen to the full podcast, or read on for some of the highlights.

On the pace of ML adoption in financial services vs. tech-first industries

[AM] In a regulated industry, the risk appetite is not the same as compared to, say, gaming, advertising or e-commerce. For example, a bank cannot just conduct an A/B test on a model that might reject people’s mortgage applications, and say “Oh well, at least we learnt something from that”.

Nevertheless, the opportunities from using ML effectively in banking are massive. We have some of the largest datasets in any industry. Finance is such an important part of everyone’s lives, so the direct customer impact we can have is also huge. 

A big part of our role internally is trying to sell that vision, convincing people that if you do this in the right way, with appropriate guardrails in place, you can get a lot of the benefits without introducing huge new elements of risk. 

On NWG’s four-pillar ML adoption journey so far

[GC] We have been on a big transformation journey over the last few years – from an organisation that was quite reactive in terms of producing reports and analytics about what has happened in the past, to one that is proactive in trying to predict how the business should evolve and adapt to different conditions. 

Five years ago, our data scientists were not even able to use Python easily; setting up an environment for a user took huge amounts of time and effort. In this, NWG was not an exception – it was typical of banks at that time. 

We had data scientists and engineers working in silos, on their own machines. They developed some code, with access to small amounts of data. Often, they had really good ideas and were genuinely innovating. But they could not escape from the confines of that single machine. There was no route to go-live that could turn their model or pipeline into a service that can then be consumed by customers, or colleagues around the organization. 

This really limited the value analytics/ ML could bring. Everything was very much done as a proof of concept in a sandbox environment. But how could we turn this into something real? 

How do we have the right people in the organization with the right skills? How do we have the right process and operating model for them to know how to use the new technology we were introducing? And how do we have the data in the right place so that it can be consumed and used by those pipelines, those models that we were building, without breaking things? Those four categories – people, process, data and technology – helped us focus, and move up the curve. Even today, as we move to the cloud, for example, we are really trying to be clear about how people use that new technology, how it impacts their day-to-day workflow, and at what points should they plug into different aspects of that cloud stack that we’re developing. 

Trying to focus on those things, as a collective, has really helped us build this environment that lets us go from a little proof of concept in a single data science notebook through to something that can be rapidly industrialized.

On the importance of getting models into production

[AM] Through no fault of their own, a lot of organisations end up with amazing data science talent who will never see a production environment. It could be that the production environment does not even exist, or that it is very hard to get into. As a result, many data scientists were unfamiliar with the concept of separate development, test and production environments. 

After years of hard work, things have changed today. Instead of a nebulous “road to go-live”, they are now able to visualise the steps needed in a much clearer way. Getting models into production is not something that requires tons of heavy lifting. We have now put a system in place that will do all the heavy lifting for data scientists.

Getting into production can also be intimidating – as data scientists now need to worry about things breaking in production and the real-life consequences of those. But we point out to them that production is where the value gets generated, and where they can impact customers, really do things at a massive scale. 

That’s what industrialization really means: getting to production at scale. And ML Ops, to me, is how you do that, again and again, in a monitored and controlled way. You free up your talent to do the things they’re meant to do. Data scientists are off building the coolest models, data analysts are busy understanding and organising the data, engineers are off building the most resilient pipelines. That’s been the biggest achievement of the past couple of years is really that work in clearing that road for people. Now you’re seeing teams that are relatively small and nimble with not tons of engineers, they are now putting machine learning products into production that affect the bank. That’s incredible to see!

Last modified on June 21st, 2023