Evaluating the Long Tail: Assessing LLM Performance Across Downstream Tasks

Evaluating the Long Tail Featured image 2 1200x630

As AI continues to become more prevalent in our daily lives we are increasingly relying on large language models (LLMs) to perform tasks such as code completion, language translation, chat support, writing assistance, generating marketing copy. They can even plug in to the real world, and work together to accomplish tasks. This raises an important question: can we trust them?

In a previous blog, we proposed feedback functions as a flexible framework for evaluating LLMs. These feedback functions must be carefully selected to evaluate the model properly, and the accuracy of the model can be the most challenging to assess because there is no set of ground truth labels to test against.

To measure the performance of foundation models, we often may need to bolt together different performance metrics for each downstream task they offer. Some of these downstream tasks include traditional natural language processing tasks like sentiment analysis, part of speech tagging, and named entity recognition; translation; text summarization; answering fact-based questions; and even creative tasks. Beyond these use cases, there is a long tail of uses for LLMs that present a challenge to evaluate.

Traditional NLP Tasks with Ground Truth Labels

When language generation models are used as an interface to perform more traditional NLP tasks like sentiment analysis, we can rely on a well-worn path to evaluate their performance. Using ground truth labels, predictions can be evaluated with metrics such as accuracy, precision, and recall. In this case, no feedback function is required as ground truth labels are often available. However, we may still want to utilize a labeling function if our label data is not sufficient.

2 robots facing standing by their respective flags and facing each other.

Translation and Summarization

Translation is a critical task often taken up by LLMs; poor or misleading translation has the potential to grow division in place of language barriers and create a false sense of understanding for those using the service. In addition to human grading, there are different metrics available for scoring translation that we can use. Bilingual Evaluation Understudy (BLEU) and Metric for Evaluation of Translation with Explicit ORdering (METEOR) are two metrics available for measuring the quality of translation that both require access to a reference translation.

Likewise, text summarization is another widespread use of ChatGPT and similar LLMs. Similar to the dangers of poor translation, error-prone summarization can exacerbate the spread of fake news in today’s headline-driven, low-attention environment. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) can be used to compare the generated summary to a reference summary, calculating the overlap between the two. Given a reference summary, ROUGE score can be calculated for different types of overlap, such as n-gram overlap or longest common sequence overlap.

For both translation and summarization, the requirement of a reference or ground truth text can be expensive. If some human grading is available, we can model this relationship between the raw text and manual evaluations; and in doing so, rely on feedback functions to cheaply augment the evaluation for both tasks.

Answering Fact Based Questions

Answering fact-based questions perhaps is the most challenging of these use cases, as it requires an expansive set of label data for truth to test against. It also comes with extensive risks, including material harm in high-stakes domains such as bad legal or medical advice along with the more enigmatic risk of eroding trust in shared knowledge and growing societal division. And challengingly, large language models are increasingly prone to frequent hallucination, the term for when the model generates text that is factually incorrect or entirely fictional.

To get a signal on the performance for these uses, we can either generate a smaller set of ground truth using more costly techniques. These include  hand-labeling and labeling functions (LFs), along with user provided feedback and feedback functions. User provided feedback whether explicit or provided via a function, provides some signal on whether responses are correct; still – this feedback is susceptible to being fooled by the AI’s overconfidence. To make matters more complicated, for some questions there is no clear right answer – such as the great debate “is a taco a sandwich?”.

Creative tasks such as writing short stories or poetry are inherently subjective to evaluate, however human creativity is often evaluated by experts in the field or even the public at large. Creative tasks completed by AI can be evaluated much the same way. 

How would you evaluate the following limerick generated by ChatGPT?

A limerick generated by ChatGPT.

Evaluating Fairness

LLMs can create material harm by perpetuating norms or amplifying biases that exist in their training data – an example of bias amplification in ChatGPT is described below. They can also perform poorly or differently for underrepresented groups. These risks stem in large part from choosing training corpora that include harmful language and overrepresent some social identities.

Varying prompts on the axis of tokens that refer to protected characteristics such as gender and race is one way to evaluate fairness on these axes, and can be used across a variety of downstream tasks. For gender, this is called gender swapping, but the idea can be extended to other classes. Below is an example where ChatGPT is prompted “How do I evaluate the job performance of a <waiter/waitress>?” When asked using the traditionally female-gendered term, waitress, ChatGPT focused on traditionally female traits such as demeanor, communication skills and attention to detail. In contrast, when asked about using the traditionally male or neutral gendered term, waiter, ChatGPT listed knowledge of the menu as the second most important criteria. In contrast, knowledge of the menu was not a criteria for evaluating a waitress. In addition, only six criteria were given for evaluating a waiter’s response while a waitress was given eight criteria.

As these tokens referring to protected classes are varied, detecting material changes in the output can illustrate the model’s reliance on these characteristics and identify any potential biases. We should be concerned of any generative language model that meaningfully changes its output as the race, gender, or class of tokens in the prompt is changed, such as ChatGPT did when prompted to evaluate the job performance of a waitress.

Evaluating More Niche Downstream Tasks

As alluded to, the race is on to develop new and innovative ways to leverage LLM capabilities. Undoubtedly this will lead to the invention of new downstream tasks not imagined by the developers, and therefore lacking a defined evaluation mechanism. In many cases, the combination of human feedback and feedback functions can bridge this gap. To best leverage both of these mechanisms, we recommend that the explicit human feedback is generated by an unbiased and representative set of reviewers that are instructed to comprehensively evaluate if the model is helpful, accurate and harmless. Without doing so, feedback functions will be inherently limited to the quality of the underlying human feedback.

Evaluating Stability

Model stability is another demand we may place on the AI system. As the prompt space drifts over time, measuring the degree to which outputs change is useful in understanding stability. Measuring data drift on any system requires access to prompt data over time, but for ChatGPT and the like, sheer scale presents a new challenge.

Model degradation is a real concern. Input distributions have the potential to degrade the performance of models significantly through adversarially chosen inputs that are designed to mislead the models, or even degrade the model itself in continuous training scenarios. In addition to adversarial inputs, LLMs may also be susceptible to adversarial human feedback used for reinforcement learning (RLHF). Other training time attacks could even preserve performance while adding an adversarially chosen sentiment to model output.

When the model owner has access to prompt and response data, human feedback (optionally augmented by feedback functions) should be used to monitor models used in production settings. Doing so can allow us to quickly identify cases of drift and how to fix them.

However in some cases, the model owner may lack access to prompt or response data due to user privacy or other restrictions. Without access to prompt data, concept drift can still be measured. Using a set of predefined prompts, observing the delta of outputs gives a view into the relationship between the fixed input prompts and the output. However, given the scale and breadth of these models’ use, it can be difficult to obtain a large enough and diverse enough set of prompts and responses to accurately evaluate the model’s performance.

Wrapping Up

The trustworthiness of generative LLMs is a complex issue that requires careful evaluation and consideration. A feedback and testing framework that takes into account the model’s intended performance can be used to understand the limits of these models and make informed decisions about how to use them effectively and ethically. There are a lot of challenges as we wade through understanding this new field – but starting to put together some answers to these questions is a good starting point.

Want to be one of the first to know about TruEra’s upcoming solution for LLMs?

Get on the waitlist! TruEra for LLMs Waitlist

If you’re interested in learning more about how to explain, debug and improve your ML models – join the community!

Last modified on November 8th, 2023