Fair Machine Learning Models: From Concept to Reality

TruEra Education fairness fair machine learning models

Machine learning is used ubiquitously in applications like facial recognition and online advertisements– however, many of these ML models show clear evidence of unintentional and harmful racial and gender biases. As a result, fairness in machine learning systems is a hot topic nowadays. Businesses that reap the efficiency benefits of AI must also take special care and expertise to ensure that their models are accurate, trustworthy and fair. 

In this blog series, we’ll first talk about how to transform fairness from an abstract goal into a reality for machine learning models. Next, we’ll outline how models can become biased. Finally, we’ll detail a fairness workflow that data scientists and validators can use to understand, measure, and debug biases within their models. We’ll show how design and machine learning can jointly enable this type of workflow, and discuss how human-centric design is critical for our product. 

What does it mean to be fair? Measuring and understanding fairness

Hey, that one’s bigger. Is that fair?                                                  Photo by Kari Shea on Unsplash

It can be hard to precisely define what “fair” means. At a fundamental level, fairness necessitates a desire for an equitable balance. From this, natural questions arise: between whom are we trying to achieve fairness? Does fairness mean equal outcomes between groups in aggregate? Or between individuals? How do we mathematically define “fair” as a quantity?

There are three key principles we identified when it comes to implementing fairness for real-world models:

  • It is important to consider both individual and group notions of fairness. 
  • Fairness is far more than a metric. Fairness is a workflow of:
    • identifying bias (the disparate outcomes of two or more groups)
    • performing root cause analysis to determine whether disparities are  justified, and
    • employing a targeted mitigation strategy.
  • Fairness metrics are not one size fits all. Using a taxonomy of fairness metrics, we can pick those metrics that are most appropriate for the particular scenario at hand.

Let’s dive a bit deeper into each of these to better understand the workflow. We’ll get started by defining and describing notions of fairness, root cause analysis and mitigation strategies, and how to pick a fairness metric.

Individual and group notions of fairness

Usually, fairness is thought of as a concept that is enforced between two groups. A widely cited example is ProPublica’s analysis of the criminal risk assessments provided by the company  NorthPointe. ProPublica’s review of NorthPointe’s predictions showed that the algorithm trained to predict whether incarcerated people would re-offend was biased against African Americans as compared to Caucasians. The algorithm wrongly labeled Black defendants as future re-offenders at almost twice the rate that it wrongly identified White defendants. White defendants were more often wrongly identified as at low risk for future offenses.

This notion of fairness as a concept between groups can be elaborated further to analyze subgroups. For example, researchers at MIT showed that facial recognition software from major technology providers can have significantly higher gender identification errors when analyzing images of dark-skinned women in particular, demonstrating a discrepancy in model effectiveness for a subgroup at the intersection of both gender and skin tone. 

Group fairness is critical, but it is also only part of the puzzle. Recall the Apple card investigation, where, shortly after the card was issued, women claimed that they were getting lower lines of credit or even rejected for Apple’s credit card while their husbands, with whom they had commingled finances, were accepted for credit, often with larger credit lines. In this case, the algorithm’s individual fairness was being brought into question. In this case, we would want to measure the approval algorithm’s fairness by comparing a variety of men and women with similar profiles, not to see if the model is discriminating at the group level (women vs. men), but if it was specifically harming individuals with similar financial profiles (in this case spouses). This style of analysis, coined individual fairness by Dwork et al., seeks to ensure that similar individuals are treated by the model in similar ways. 

The Apple card situation is also an instance where fairness and another big challenge in AI, model explainability, are inextricably linked. Not only did individuals with similar financial situations appear to have widely different outcomes, Apple and Goldman Sachs, the issuers of the credit cards, could not quickly and easily explain how the algorithm worked, why it had delivered disparate outcomes, and how these outcomes were justified. Perceived unfairness and an inability to explain a model can have serious consequences. Within just a few months of starting to issue credit cards, Apple and Goldman Sachs found themselves under investigation by a financial regulator.

Root cause analysis and informed mitigation

Let’s say we built a credit decisioning model that determines whether an individual should receive a loan. The model is trained on data that do not  use any specific demographic features, such as gender or ethnicity. However, when evaluating outcomes after the fact, it can be seen that the model finds a slight correlation between approval rates and race. Is this an unfair outcome, because the model is constructing proxies of race through other variables, such as zip code or surname? Or is this result justified and fair, because the model is using a reasonable set of features, and there just happens to be a correlation? 

It is important that any possible lack of fairness is investigated thoroughly via root cause analysis. Utilizing an AI Quality solution like TruEra, which includes fairness analytics, model builders are no longer blind to the underlying issues of the model or data. Drilling into feature-level drivers of this disparity can be crucial in identifying whether the model’s behavior is justified. Is geographic data driving the higher rate of acceptances for a particular racial demographic? Or is the model using a feature like income in a reasonable way, but it just so happens that income is correlated with race? Performing root cause analysis to connect an oftentimes abstract fairness metric back to the model and its input data can allow data scientists or business executives to determine whether the model’s decisions are justified, and if not, where to start the mitigation process. 

Picking a group fairness metric

Fairness is a sensitive concept. The decision of which group fairness metric to benchmark a model against defines a version of the world to which we aspire. Are we looking to make sure that men and women are given an opportunity at exactly equal rates? Or do we instead want to make sure that the proportion of unqualified people who are accepted by the model is roughly equal across gender? These small nuances have vast implications on the fairness of a model, making the choice of metric an essential and intentional act.

While there are dozens of definitions of fairness metrics that might all be valid in certain scenarios, for the purposes of this post, we’ll zoom out a bit and briefly talk about how these fairness metrics describe different philosophies.

One inherently opposed set of worldviews is the What You See Is What You Get (WYSIWYG) and the We’re All Equal (WAE) distinction. Let’s consider the case of SAT scores. The WYSIWYG worldview assumes that observations (your SAT score) reflect ability to complete the task (do well in school) and would consider these scores extremely predictive of success in college. Whereas, the WAE worldview might assert that SAT scores have structural biases due to societal reasons, and that on average, all groups have equal ability to do well in college even if we can’t observe this through SAT score. If you believe WYSIWYG, then a perfectly accurate model is necessarily fair because it will match the labels/observations you have on hand. Whereas, a WAE believer would instead attempt to ensure equal distributions of outcomes for groups, regardless of the labels they observe, since, in their view, accuracy does not necessitate fairness. 

There is a sliding scale between WAE and WYSIWYG, and lots of metrics lie in between these opposing notions of what it means to be fair. The two camps have a natural tension between them, because both cannot be satisfied simultaneously. In reality, fairness metrics are often incompatible. Instead of blindly benchmarking a model against a variety of fairness metrics, it’s prudent to take a step back. What objective would you consider “fair” for your problem? And which fairness metric captures this objective? This is an intentional and often difficult choice– and it’s extremely important when considering an end-to-end fairness workflow. 

Interested in more about fairness in machine learning? Check out the next blog, “How did my model become unfair?,” in which we discuss how biases can creep into AI/ML models.

Authors:

Divya Gopinath

Russell Holz

Last modified on August 30th, 2023