Building and evaluating RAGs with query planning

TruEra Education Building and evaluating RAGs featured image

Query planning engines, such as Retriever Router Query Engine and  Agent Query Planning, enable Large Language Models (LLMs)  to provide more accurate and context-aware responses, through structured query decomposition and resource management. In this piece, we explain how to build and evaluate Retrieval-Augmented Generative models (RAGs) with query planning engines.

Building Query Planning Engines

The Llama-Index approach

Developers begin by importing data from sources like Llama-Index and TruLens; configuring access keys (e.g: OpenAI keys) for secure data access, and evaluation metrics (e.g: agreement with GPT-4) for performance assessment. To streamline evaluations, a monitoring dashboard is initiated ahead of time. Then, relevant data is loaded, and the configuration space is carefully defined through systematic iterations, including the exploration of different embeddings and chunk sizes. This iterative process is crucial to fine-tune the system. Additionally, test prompts are thoughtfully set up to assess model responses. The strength of this approach lies in the meticulous and systematic evaluation of responses against ChatGPT using TruLens, ensuring that the model’s performance aligns with the intended goals.

Retriever router query engine approach

The Retriever Router Query Engine approach adds another dimension to query planning, with a focus on efficiently handling extensive choice sets. The process starts with data loading and subsequent transformation into nodes, which are then inserted into a DocumentStore. To optimize retrieval efficiency, summary indices and vector indices are defined. For each index, a dedicated Query Engine is crafted and encapsulated within a QueryEngineTool, providing a structured approach to resource management and response retrieval. 

This approach is particularly appropriate to address the challenges associated with the management of large choice sets. Indeed, Retrieval-Augmented Router Query Engine dynamically retrieves the most relevant query engines during query-time, all facilitated by an object index that manages query engine tools. Further, it also limits resources and ensures that the system can handle a wide range of scenarios.

The OpenAI approach

The OpenAI Agent Query Planning approach takes query planning a step further by seamlessly integrating it into the OpenAI Agent workflow. This integration is made possible by the inclusion of a QueryPlanTool, which has become the cornerstone of advanced query planning within this framework.

Implementing this approach requires adding a QueryPlanTool to an OpenAI Agent. This tool is designed to accept a set of other tools as input, enabling the creation of a query plan Directed Acyclic Graph (DAG) using QueryNode objects. When invoking the tool, the agent defines the structure of the graph through the function signature, laying out the path to optimized responses. Subsequently, the tool executes the DAG, orchestrating the various tools to collaboratively work towards generating coherent and context-aware responses. Here is an illustration, , consider the task of generating queries related to financial filings. This process begins with data loading, followed by the creation of vector indices and query engines for each document. Finally, the QueryPlanTool steps in to create a query plan DAG that efficiently retrieves the necessary information.

Challenges and solutions

While query planning is a powerful technique for enhancing the capabilities of Large Language Models (LLMs), it is not without challenges:

  • Data quality and consistency: The quality and consistency of the data used for training and evaluation can significantly impact the performance of RAGs. Inaccurate or biased data can lead to suboptimal responses. To solve this, we can implement robust data preprocessing and cleaning techniques to ensure the quality and consistency of the training data. This may involve data validation, error correction, and bias detection and mitigation.
  • Resource scalability: As the size and complexity of RAGs grow, resource scalability becomes a concern. Large-scale RAGs may require substantial computational resources and memory. Here, implementing  distributed computing and cloud-based solutions to scale resources dynamically based on the demands of the RAG may help. This enables  more efficient handling of resource-intensive tasks and ensures that the system can adapt to varying workloads.
  • Evaluation Metrics: Assessing the performance of RAGs can be challenging, as traditional metrics may not fully capture the nuances of context-aware responses. Developing custom evaluation metrics that are tailored to the specific objectives of the RAG is essential. This may require human evaluation, where human judges assess the quality and relevance of RAG-generated responses in real-world scenarios.
  • Handling complex user queries: Complex user queries may involve multiple sub-questions or require nuanced responses.. To handle this challenge, we can break down complex queries into sub-questions using natural language processing techniques. Another option is to create a structured query plan that addresses each sub-question individually, allowing the RAG to generate coherent responses by aggregating the results.


Query planning is a powerful technique to enhance the capabilities of LLMs such as ChatGPT with context-relevant responses. It can be implemented through various techniques including decomposing complex user queries into sub-questions or efficiently managing extensive choice sets. By using these techniques, we can unlock the full potential of LLM-driven systems, paving the way for more context-aware and effective responses for a wire range of applications.

Last modified on November 8th, 2023