How did my ML model become unfair?

TruEra Education ml model unfair

You built your model with the best of intentions. So why is it exhibiting unfairness?

Photo by Tim Mossholder on Unsplash

In the first blog post, “Fair Machine Learning Models, from Concept to Reality,” we discussed three key points to creating a comprehensive fairness workflow for ensuring fairness for machine learning model outcomes. They are:

  • identifying bias (the disparate outcomes of two or more groups)
  • performing root cause analysis to determine whether disparities are justified, and
  • employing a targeted mitigation strategy.

But ML practitioners presumably build their models with the best of intentions. So how can a model become unfair or biased in the first place? 

At its core, there are two broad categories of unfairnesses that an ML model can have, which we have adapted from Barocas & Selbst 2016:

  1. The model could exhibit bias because of observed differences in model performance. This is when the model is exhibiting more error on a particular group of people due to either the model’s training procedure or insufficiencies in the training data set (e.g. lack of data). 
  2. The model could be biased because of unobserved differences in model performance. In this case, the model is not obviously doing “worse” on one group compared and may be correctly matching the ground truth labels in a test dataset, but is still considered unfair because it replicates or amplifies existing biases in the dataset.

Let’s dive a bit further into each of these. 

Unfairness due to observed differences in model performance

A common aphorism heard in AI/ML contexts is “garbage in, garbage out;” that is, if the input training data is biased or incorrect in any way, the model will reflect these biases and inaccuracies. Yet sometimes, even if the ground truth data is reliable, the model may be more error-prone on a group of population as compared to another.


As an example, when researchers analyzed the accuracy of commercial facial recognition systems, they found that models exhibited poorer performance on women and darker-skinned individuals as compared to lighter-skinned men, and that this stemmed from a lack of dark-skinned women in the model’s training data. In this case, the data itself was not “incorrect” but exhibited sample or representation bias: it was not reflective of the population on which the model was meant to operate. The model was inappropriately generalizing data collected on lighter-skinned men to other people, and thus had lower accuracy on subgroups of the population.  

This assumption that enough “big data” will necessarily create fair and trustworthy models has been proven incorrect time after time. As Kate Crawford argues, social media is a popular source of large-scale data analysis, but only 16% of online adults even use Twitter— so conclusions that are drawn exclusively from Twitter data, such as the claim that people are saddest on Thursday nights, are not necessarily correct. We’re assuming that a trend seen in one population will carry over to another.

Another reason why a model might perform worse on a group of the population, and thus be considered unfair, is the data availability or sample size disparity problem. Input features may be less informative or collected unreliably for minority groups. The city of Boston, for example, attempted to collect smartphone data from drivers going over potholes in an effort to fix road issues faster– but by relying on data that necessitates owning a smartphone, the city realized that neighborhoods with older or less affluent populations wouldn’t be captured as well. These “dark zones” or “shadows” can under-represent or even overlook critical populations. Predictive models can often favor groups that are better represented in the training data because there is less uncertainty associated with those predictions.

Unfairness due to unobserved differences in model performance

It’s not always the case that an unfair model performs worse (e.g. has lower accuracy) on a subgroup of the population. In fact, if the dataset itself contains biased or incorrect labels, the model might be perceived as perfectly accurate, but still unfair. 

Let’s take a concrete example to illustrate this further. Natural language models are often trained on large corpora of human-written text, such as news articles. Yet word embeddings that were trained on vast amounts of Google News data were found to be biased because they perpetuated gender stereotypes from how journalists wrote about men versus women. The researchers behind this study showed that the embeddings closely related gender and specific occupations– “homemaker” and “nurse” were extremely female occupations according to the embeddings, while “maestro” and “boss” were more masculine. This kind of historical bias, also known as negative legacy, occurs when a model is trained on data that is in and of itself biased due to unfair systems or structures. 

It’s not just that models simply replicate these historical biases. In particular cases, they can exacerbate them— and this is known as amplification bias. A real-world example of this is the GRADE algorithm that was used for university admissions. The model was trained on prior admissions data to determine which applications constituted a “good fit.” However, once the model was put into production, it simply overfit to prior admissions decisions rather than actually assessing candidate quality. If these decisions had been used in practice, it only would have amplified existing biases from admissions officers. 

 Historical biases in the data can get amplified in the AI model.     Photo by @chairulfajar_ on Unsplash

It is not just the ground truth labels of a dataset that can be biased; faulty data collection processes early in the model development lifecycle can corrupt or bias data. This problem is known as measurement bias. This can be common if a machine learning model is trained on data generated from complex data pipelines. As an example, we can use Yelp’s restaurant review system. Yelp allows restaurants to pay to promote their establishment on the Yelp platform, but this naturally affects how many people see advertisements for a given restaurant and thus who chooses to eat there. In this way, their reviews may be unfairly biased towards larger restaurants in higher-income neighborhoods because of a conflation between their restaurant review and recommendation pipelines. 

Upon hearing the myriad ways that a model can be biased, a natural suggestion is to restrict its access: why not train models on data that don’t contain sensitive attributes like gender or race? For example, early predictive policing algorithms did not have access to racial data when making predictions but the models relied heavily on geographic data (e.g. zip code), which is correlated with race. In this way, models that are “blind” to demographic data like gender and race can still encode this information through other features that are statistically correlated with protected attributes– this phenomenon is known as proxy bias. It can be hard to disentangle proxy bias because input features into a model are usually correlated with each other, but practitioners who carefully consider data provenance and seek out less biased alternative data to train their model can mitigate its effect. 

How do I fix an unfair model?

There are plenty of ways that a well-intentioned data scientist can inadvertently train an unfair model. But with the right tools, it’s possible to both understand and mitigate these biases such that you can trust your model before deploying it. 

Check out the next blog, “Designing a fairness workflow”, in which we detail a fairness workflow that data scientists and validators can use to understand, measure, and debug biases within their models.


Divya Gopinath

Russell Holz

Last modified on August 30th, 2023