Foundation Models Explained

TruEra Education llms what are foundation models featured image

Foundation models (e.g. GPT-4, PaLM2, RoBERTa) are increasingly affecting a paradigm shift in how AI-enabled applications are being built and deployed. Trained on billions of examples, generally using self-supervision at scale, they can be adapted to a wide range of downstream tasks and are rapidly replacing task-specific models. These models serve as the building blocks of the next generation of AI applications. In this article, we will explain how to build foundation models and overcome some of the challenges that one may face along the way. 

Steps to Build Foundation Models

  • Step 1 – Data gathering and preprocessing: This process is initiated by amassing a diverse and extensive dataset culled from various sources on the internet, encompassing billions of text documents such as books, articles, and web content. Then, this data is broken down into smaller units, either words or subwords, while irrelevant information is eliminated. 
  • Step 2 – Crafting the transformer architecture: Select an appropriate version of the transformer architecture, such as the Generative Pre-trained Transformer (GPT) architecture. This architecture comprises numerous layers of self-attention mechanisms and feed-forward neural networks, facilitating the model’s comprehension of contextual relationships among words.
  • Step 3 – Initializing the Model: Start by initializing the model with random weights, thereby creating a blank canvas for the learning process. These initial weights will undergo adjustments throughout training via backpropagation.
  • Step 4 – Defining the pre-training objective: Establish a pre-training objective, typically centered around a language modeling task. The model is trained to predict the subsequent word in a sentence based on the preceding words, fostering its understanding of syntactical structures, grammatical rules, and contextual nuances.
  • Step 5 – Training: Infuse the pre-processed data into the model and initiate training using gradient-based optimization techniques, such as stochastic gradient descent. The model’s parameters undergo iterative updates aimed at minimizing the disparity between predicted and actual subsequent words.
  • Step 6 – Model size and scalability: Model size is a crucial consideration. Larger models have the ability to capture more intricate patterns but require increased computational resources. Striking a balance between computational capability and model size is imperative. Larger models have more parameters, which allows them to learn a wider range of patterns, but they also require longer training cycles and more data.

Popular Models

There are some of the most popular foundation models:

  • GPT-4 (Generative pre-trained transformer 4):GPT-4, developed by OpenAI, is known for its exceptional natural language understanding and generation capabilities. It can perform a wide range of language tasks, from translation to text generation, and even question-answering. It has 1.76 trillion parameters, making it one of the largest models available. GPT-4’s main drawbacks include the cost of computation, which can be prohibitive for many organizations. 
  • BERT (Bidirectional encoder representations from transformers): BERT, developed by Google, revolutionized NLP by introducing bidirectional contextual embeddings. It excels in various NLP tasks, including sentiment analysis, named entity recognition, and question-answering. BERT has been widely adopted and serves as a strong foundation for custom models. While BERT is a powerful model, it requires a substantial amount of training data and computational resources to fine-tune effectively. It’s also relatively large, which can be a disadvantage for deployment in resource-constrained environments.
  • RoBERTa (Robustly optimized BERT pre-training approach):RoBERTa is an optimized version of BERT, developed by Facebook AI. It was trained on more data and for a more extended period, leading to improved performance in various NLP tasks. It’s known for its robustness and adaptability. Similar to BERT, RoBERTa’s primary disadvantages include the need for significant computational resources for fine-tuning and potential challenges related to model size.


While foundation models offer immense potential, their development is not without challenges: 

  • Scale and computational resources: Training foundation models require huge computational resources, including high-performance GPUs and TPUs, to process and optimize the vast amount of training data. Smaller organizations or researchers with limited resources are unlikely to have enough funding to cover these costs. Yet, this challenge can be overcome through collaboration. Indeed, several research organizations have shared resources and expertise to build foundation models.
  • Bias: Foundation models learn from the data they are trained on and can potentially amplify  biases present in that data. To address this challenge, data scientists  employ a range of bias mitigation techniques, including data augmentation, fairness constraints, alignment with human values, and model refusal. Among these techniques, alignment with human values involves continuous feedback and reinforcement learning, enabling the model to enhance its responses in real-world contexts and reduce biased outputs. While model refusal serves as a protective measure where the model can refuse to follow user requests that are considered inappropriate or harmful. In addition, AI developers are encouraged to conduct regular audits and ongoing monitoring of their foundation models.
  • Data privacy: Foundation models, trained on massive datasets, virtually eliminate the option of anonymity and may expose sensitive information. This risk can be mitigated with robust privacy-preserving mechanisms, including data anonymization, encryption, and stringent access controls. However, striking a delicate balance between harnessing the power of foundation models and safeguarding individuals’ privacy rights remains an ongoing challenge.
  • Explainability: As foundation models grow larger and more complex, understanding their decision-making processes becomes increasingly difficult. Recently, a research team at Anthropic managed to scale influence functions (IF) up to LLMs with up to 52 billion parameters while preserving the accuracy of their estimates, offering hope that we may be able to explain LLMs’ behaviors in the near future. But it remains an important area of research. 
  • Generalization and adaptation: While foundation models are designed to generalize across various tasks, they usually perform poorly on specialized tasks without fine-tuning. Transfer learning techniques can partially address this problem. These techniques allow models to leverage their knowledge across tasks, enabling them to adapt more quickly and efficiently to new tasks with limited data.
  • Risk of generating harmful or false content: Ensuring the responsible deployment of foundation models requires the implementation of mechanisms to detect and prevent the generation of harmful or malicious outputs. A prevalent approach to address this concern is data filtering and preprocessing measures, effectively removing offensive or inappropriate content from the training data. Adversarial training techniques can also help improve the model’s capacity to recognize and reject toxic or malevolent inputs.

Are you leveraging foundation models to build LLM-powered apps? 

If yes, you should use Trulens to track and evaluate their performance. Trulens provides a range of feedback functions for the evaluation and monitoring of LLM-powered applications. A feedback function takes as input generated text from an LLM-powered app and some metadata and returns a score.

Last modified on September 21st, 2023