AI Quality is a key issue holding organizations back from success with AI. But what is it exactly? Here’s a framework.
Artificial intelligence (AI) holds tremendous potential to create value for organizations and broader society. A McKinsey study estimates that AI applications could generate $3.5 to $5.8 trillion in value per year, across 19 industries alone. This promise, coupled with the broad availability of solutions for data preparation, model training, and model deployment, has spurred a serious uptake in the adoption of machine learning (ML) and AI in enterprises.
These enterprises are now shifting their focus from getting these basic building blocks in place to tackling the next big challenge: how do you drive real, sustainable organizational value with AI? This shift has been prompted by a series of challenges in effectively operationalizing and extracting value from AI. It has been estimated that only 13% of models in development ever make it to production, an extremely large gap that has been called “the AI Chasm.”
Challenges that ML models face
Model development has traditionally focused on optimizing accuracy of AI models. However, that is just the tip of the iceberg. There are a number of other attributes of data and AI models that need careful attention to effectively drive organizational value.
Challenge #1 – Model performance, in dev and real world use
Ensuring that model performance (and the associated organizational value that it drives) is maintained as a model moves from development to production is a significant challenge. This problem was observed in spades during the Covid-19 pandemic when consumer behavior suddenly shifted, causing numerous deployed models (e.g., shopping behavior at retailers, supply chain logistics, sentiment analysis, fraud and credit risk) to significantly degrade in performance. For example, a food supply company saw its automated inventory management system fail when sudden bulk orders broke its predictive algorithms. Another company that uses AI to assess the sentiment of news articles and provide investment recommendations noticed the advice skewed more negative than it should have because the news was gloomier than usual.
Challenge #2 – The societal impact of models, for issues such as fairness
When machine learning models are used to make predictions and decisions that impact the lives of people, they cannot drive organizational value without guardrails to ensure that their societal impact is not negative. This requires careful attention to fairness, transparency, accountability and related considerations. For example, Amazon was forced to abandon an AI-driven recruiting tool after it showed bias against women. Apple and Goldman Sachs found themselves under investigation when their credit card offering appeared to give women smaller credit lines than their husbands. Regulatory activity across the world provides a further impetus for organizations to recognize the importance of this topic as they embrace AI to drive value.
Challenge #3 – The ability to operationalize AI applications to effective use cases
ML models are often used as part of decision systems involving human operators. Examples include credit decisioning, clinical workflows, predictive policing, recidivism prediction, sales and customer service operations, to name a few. ML models often do not get adopted in these settings because their decisions are not explained to humans and they fail to inspire trust. We have seen examples of doctors ignoring recommendations from an AI-driven oncology application, since the model’s predictions and rationale are not explained and don’t inspire confidence.
Challenge #4 – Data quality
A related challenge is that of ensuring data quality. Indeed the quality of data has a significant impact on how well an AI model delivers value.
The silo illusion: all of these are actually AI Quality problems
While all of these challenges may at first appear unrelated, or addressable individually, the reality is that they are interrelated. Given these tight connections, I believe that we need to place these challenges in a single category and address them holistically. At TruEra, we use the term AI Quality to refer to this category. Solving the challenge of AI Quality represents a huge opportunity for models to gain trust, achieve consistent performance, and operate in fair, socially accepted ways. What’s at stake? Trillions.
Defining AI Quality
“Quality” may seem like an overly abstract, or worse, arbitrarily defined concept. If beauty is in the eye of the beholder, is the quality of an AI solution also in the eye of the beholder? And what happens when multiple different stakeholders need to evaluate and approve of a model’s quality before it can be put to practical, real-life use? How can you get all of these beholders to agree?
For these reasons, the definition of quality is, in fact, reliant upon metrics and quantification, aspects that can be agreed upon and upon which standards can be based.
AI Quality is the set of observable attributes of an AI system that allows you to assess, over time, the system’s real world success. In this case, real world success includes the value and risk from the AI System to both the organization and broader society.
The categories of AI Quality
AI Quality is evaluated across four key categories:
- Model performance: the observable attributes of business value and risk, such as model accuracy, stability, conceptual soundness, and robustness.
- Societal impact: the observable attributes of societal value and risk, such as fairness, transparency, privacy, and security.
- Operational compatibility: those attributes that enable humans to work more effectively with the AI system, and for the AI system to work with other systems in a larger process to achieve a business goal. This includes things like explanations of the model function, documentation, and collaborative capabilities.
- Data quality: the attributes of a dataset used to build & test models that impact model fitness, including missing data, and data representativeness, as well of quality of production data.
Core to AI Quality management: explainability
AI Quality is measured across the four categories. But assessing and improving AI Quality requires a key capability: explainability.
Why is explainability important?
Explainability is the ability to accurately characterize a model’s function. Explainability is key for operational compatibility in decision-systems involving humans and as a building block for root cause analysis of performance and fairness (societal) problems, often tracing these problems back to data quality issues. You can read more about the role of explainability in advancing AI Quality in the blog “Machine learning explainability is just the beginning.”
Not all explainability is alike
While explainability is key, it is not a one-size-fits-all technology. There is a broad set of explainability methods and users have to carefully pick the right method for their use case and for evaluating AI Quality attributes. Explainability methods differ along several dimensions: (1) Explanation scope. What is the scope of the explanation and what output are we trying to explain? (2) Inputs. What inputs is our explanation method using? (3) Access. What model and data access does the explanation method have? (4) Stage. To what stage of the model do we apply our explanations? There has also been significant progress in understanding the theoretical underpinnings of various explanation methods, including those that build on Shapley Values.
Under one banner: the interrelationships of AI Quality elements
You might be wondering, how do all of these categories fit under one banner? How can accuracy possibly be related to bias? How does model explainability connect to its robustness?
A key insight is that these four key areas are deeply connected. Model performance and societal impact attributes like fairness, transparency and privacy have known technical relationships and tradeoffs. For example, sometimes achieving fairness involves accepting a model with lower predictive accuracy. Robust models in computer vision tend to produce better explanations but may lose some test accuracy. A set of challenges that negatively impact model performance, fairness, and operational efficacy trace back to issues with data quality. For example, challenges with fairness of facial recognition models and accuracy of medical diagnosis models are tightly connected to data quality issues, such as historical bias and labeling errors in training data.
In summary, many of the challenges in bringing effective, trustworthy, and responsible AI-based applications to fruition today are actually interrelated. They fit under the category of “AI Quality” problems. This blog post shows how these problems are interrelated, why explainability is key to solving them, and provides a framework for thinking about AI Quality.
In the next blog, we will talk about processes and tools to manage AI Quality, and how you can use this framework of AI Quality to address the key challenges that ML models face in getting into and staying in production.
For more information on the fundamentals of AI Quality, check out the AI Quality Education page.