What’s Missing to Evaluate Foundation Models at Scale

Model at Scale Featured image 1200x630

The Opportunity

Foundation models (e.g. GPT-4, PaLM, LLaMA, DALL-E) are increasingly affecting a paradigm shift in how AI-enabled applications are being built and deployed. A foundation model is “any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g. fine-tuned) to a wide range of downstream tasks” (see Figure 1) [Bommasani et al. 2021]. One popular downstream model is ChatGPT with which users can interact in a conversational way. Built on the GPT-series of foundation models, ChatGPT saw tremendous uptake with its user base growing to 100 million users in just 2 months after launch. The growing set of applications built on these models and their potential for impact is huge. We will focus in this article on Large Language Models (LLMs) and downstream models and applications built on them. Many of these methods carry over to other types of foundation models as well. 

A foundation model is trained on massive volumes of broad data and can be adapted to a wide range of downstream tasks.
Figure 1: A foundation model is trained on massive volumes of broad data and can be adapted to a wide range of downstream tasks. Image source: Bommasani et al. 2021

A Brief History

The path to the current generation of powerful foundation models was paved by a series of research and technology advances. Let’s look at some of the key milestones for language models.  

Language models evolved from statistical N-gram based models (see [Jurafsky and Martin 2023; Ch. 3]) to neural network approaches such as Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTMs) networks (see [Jurafsky and Martin 2023; Ch. 9]). While RNNs and LSTMs were a significant step forward from N-gram based language models, one weakness was that the sequential architecture of these models meant that training could not be parallelized and scaled to very large data sets. This challenge was addressed with the development of Transformer models [Vaswani et al. 2017], which dropped the sequential recurrence architecture in favor of only parallelizable attention mechanisms to learn global dependencies between inputs and outputs.  Adaptations of this transformer architecture, coupled with advances, such as the massive scale of models and training data sets, byte-pair encoding, and reinforcement learning with human feedback (RLHF) have all contributed to the creation of today’s class of language models.

The Challenge

While the opportunity for impact from foundation models is huge, it is absolutely critical to address the challenge of evaluating their quality.

Evaluating the quality of foundation models and fine-tuning them in an informed manner is both very important and highly challenging.

Many of these very large models are evaluated unevenly, on different scenarios and datasets. Models such as Google’s T5 (11B) and Anthropic’s Anthropic-LM (52B) were not evaluated on a common single dataset  in their original works. While others (e.g. AI21 Labs’ J1 Grande (17B), Cohere’s Cohere-XL (52B), Yandex’s YaLM (100B)) do not report public results at all.

To address this challenge, one approach is the Holistic Evaluation of Language Models (HELM). It adheres to three key elements: (1) broad coverage of quality metrics and recognition of their incompleteness; (2) multi-metric measurement; and (3) standardization. We view the HELM approach as an excellent starting point for evaluating LLMs. These benchmarks can be quite useful for LLM model selection for various tasks. 

In addition to this form of evaluation, we believe that LLMs and applications built on top of them need to be evaluated in a systematic way with mechanisms that incorporate human feedback and can be easily tested and monitored over time on production data. Without testing and monitoring at scale, models can drift or perform poorly when used in production.

In a limited capacity, human feedback can be collected for evaluation. This can include explicit feedback (e.g. through thumbs up or down indicators). But to test and monitor at scale, we need this feedback across our entire dataset to ensure reliability.

Furthermore, our current toolset for debugging and explaining ML models  (e.g. leveraging Shapley Values or gradient-based explanations) applied to discriminatory models and tabular, image, or text data is often not practical for this new class of models.  This is partially due to their sheer size, both in the number of parameters and the size of the training data. Additionally, access to the model is often limited to restricted APIs with no access to the model object itself, leaving no way to compute gradients.

Enter: Feedback Functions

We propose feedback functions as a method for augmenting explicitly provided human feedback.

A feedback function takes as input generated text from an LLM (or a downstream model or application built on it) and some metadata, and returns a score. 

Analogous to labeling functions, a human-in-the-loop can be used to discover a relationship between the feedback and input text. By modeling this relationship, we can then programmatically apply it to scale up model evaluation.

Two mechanisms for feedback with ChatGPT. The first feedback mechanism is through explicit human feedback from users about the quality of ChatGPT responses. The second is a feedback function that tracks the sentiment of user prompts as an indicator of the quality of ChatGPT responses.
Figure 2: Two mechanisms for feedback with ChatGPT. The first feedback mechanism is through explicit human feedback from users about the quality of ChatGPT responses. The second is a feedback function that tracks the sentiment of user prompts as an indicator of the quality of ChatGPT responses. 

Feedback functions can be implemented with explicit feedback from humans (e.g. thumbs up/thumbs down),  simple rule-based systems, discriminatory machine learning models such as those for sentiment analysis, or even using another LLM.

Once the function is created, it can be used to test models and applications as they are being developed. They can also be placed with the inference capability of the AI system in production to enable monitoring at scale. Doing so allows us to segment and track LLM performance over different LLM versions.

Monitoring of a canary deployment of gpt-3.5-turbo as a challenger to text-davinci-003.
Figure 3: Monitoring of a canary deployment of gpt-3.5-turbo as a challenger to text-davinci-003.

Similar to HELM, this approach can be applied to a wide variety of downstream tasks, such as question answering and information retrieval. It could also be applied to a diversity of metrics measuring accuracy, robustness, fairness, stability, toxicity and more for each task.

Feedback Functions for Question Answering

Question answering is a general format where the model is presented with a question and tasked with producing the answer. Accuracy of question answering tasks can be measured on a base data set of correct responses (such as performed in HELM) or by a human-in-the-loop that confirms the accuracy of responses. A classification model trained on the correctness labels can then be used as a feedback function to augment the evaluation on an ongoing basis.

A feedback function for question answering toxicity can be set up in a similar manner. After a human-in-the-loop is used to evaluate the toxicity of each answer, a classification model can be trained on the toxicity labels and used as a feedback function.

Feedback Functions for Information Retrieval

An information retrieval takes as input a query and returns a ranked list of documents. Online metrics such as click through rate and session abandonment rate, and offline metrics including precision and discounted cumulative gain can be used to measure the performance of an information retrieval system. Online metrics can be then directly monitored as they are collected by the end system, while offline metrics rely on relevance labels. These offline metrics, after the human-in-the-loop produces an initial set of labels, can be generated by a feedback function that ranks the relevance of each document for a given query.

Offline metrics for fairness can also make use of feedback functions. Documents produced by a given query can be matched to equivalents with a protected class perturbed (gender, race, dialect, etc.) and evaluated to determine whether their ranking changed. When the ranking differs, the query fails – earning a label 1, otherwise earning a label of 0. Again, a classification model can then be trained to predict these fairness failures and used at scale as a feedback function.

Implementation with LangChain

To put feedback functions in action, we tested them against a variety of models made available through OpenAI including ada, babbage, and text-davinci-003

Instantiation of OpenAI models using langchain
Figure 4: Instantiation of OpenAI models using langchain.

Additionally, to highlight the flexibility of using LLMs as feedback functions themselves, we use the langchain Prompt Templates to systematically prompt gpt-3.5-turbo to act as a feedback function with a few simplistic examples. As described above, it is possible to create feedback functions that rely on a database of facts and are use case specific. In this example, we want to ensure that the LLM is honest, helpful and harmless. In this case, we’ll operationalize harmless to specifically mean that the LLM is not offering health advice.  This is done as follows:

Feedback function prompt templates set to input the LLM response and output feedback. These are simplistic examples for illustrating how feedback functions may be created.
Figure 5: Feedback function prompt templates set to input the LLM response and output feedback. These are simplistic examples for illustrating how feedback functions may be created.

Each model is then fed the following prompt:

How do I know if my LLM is hallucinating in production?

Below are the responses from each LLM:

Ada: If the LLM is hallucinating in production, it may be feeling of cool and dry conditions, feeling like there is a barrier between the LLM and world, feeling like it is living in a world that is different than the real world.

Babbage: It is possible that an individual’s LLM is hallucinating in production. If an individual experiences unusual or overwhelming hallucinations, it is possible that their LLM is experiencing a mental disorder.

Text-Davinci-003: The best way to know if your LLM is hallucinating in production is to monitor its performance metrics and compare them to expected values. Additionally, it can be useful to monitor any errors or warnings that the LLM may be producing.

We then use the honest, helpful and harmless prompt templates we created above and feed them into gpt-3.5-turbo to create our feedback function.

Feeding the response formatted as each feedback prompt separately into gpt-3.5-turbo.
Figure 6: Feeding the response formatted as each feedback prompt separately into gpt-3.5-turbo.

Once we’ve done so, we can observe the following evaluation results:

Single observation feedback results; text-davinci-003 is the champion.
Figure 7: Single observation feedback results; text-davinci-003 is the champion.

This test suite can be used both in development for model selection (in the single observation example, we would select text-davinci-003). Using a larger sample of responses from the LLM, we can set reasonable thresholds for as requirements for the model to pass before it is selected as the champion.

Creating test thresholds for the feedback function.
Figure 8: Creating test thresholds for the feedback function.
Viewing the test results on a larger sample; text-davinci-003 is the champion.
Figure 9: Viewing the test results on a larger sample; text-davinci-003 is the champion.

Once we’ve moved the champion model to production, we can then leverage the same test suite as alerts to detect model degradation at scale without relying on costly human feedback.

Debugging Issues

When issues with accuracy, toxicity or fairness are discovered – we need to quickly identify the root cause and mitigate the issue. In the embedding space, we can cluster unsatisfactory results in a given time period to identify similar examples of failure modes.

Tradeoffs Introduced

Inference is costly, and feedback functions are no exception. The cost of inference for these functions should be measured against the benefit of detecting issues with accuracy, fairness and more. Quicker detection and debugging of these issues benefits key business KPIs through improved accuracy, better user experience, and increased credibility of the system. In many cases, a suite of comprehensive lower cost feedback functions may provide deeper insights than a single higher quality one.

Key Takeaways

  • Evaluating the quality of foundation models and fine-tuning them in an informed manner is both very important and highly challenging.
  • A feedback function takes as input generated text from an LLM (or a downstream model or application built on it) and some metadata, and returns a score indicating its quality.
  • Feedback functions provide a programmatic way of testing and monitoring the quality of LLMs at scale.

Learn More!

We have more coming soon on how to monitor, evaluate and debug foundation models including:

  • When to Align Foundation Models
  • How to Evaluate the Long Tail of Downstream Tasks

Want to be one of the first to know about TruEra’s upcoming solution for LLMs?

Get on the waitlist! TruEra for LLMs Waitlist

If you’re interested in learning more about how to explain, debug and improve your ML models – join the community!

Co-written with Shayak Sen and Joshua Reini.

Last modified on November 8th, 2023