Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

Retrieval augmented generation (RAG) has become the de facto way to customize large language models (LLMs) for tailored insights. However, RAG has upfront technical costs and can be slow. Now, with advances in long-context LLMs, companies can bypass RAG by inserting all proprietary information into the prompt.

A new study by National Chengchi University of Taiwan shows that by using long-context LLMs and caching techniques, you can create custom applications that outperform RAG pipelines. Called cache-augmented generation (CAG), this approach can be a simple and effective replacement for RAG in enterprise environments where the body of knowledge can fit within the model’s context window.

RAG limits

RAG is a effective method to handle open domain questions and specialized tasks. It uses retrieval algorithms to gather documents relevant to the application and adds context to allow the LLM to develop more precise answers.

However, RAG introduces several limitations to LLM applications. The added recovery step introduces latency that can degrade the user experience. The result also depends on quality of document selection and classification stage. In many cases, the limitations of the models used for retrieval require documents to be broken into smaller pieces, which can harm the retrieval process.

And in general, RAG adds complexity to the LLM application, requiring the development, integration, and maintenance of additional components. The additional overhead slows down the development process.

Increased cache recovery

*RAG (top) vs CAG (bottom) (source: arXiv)*

The alternative to developing a RAG pipeline is to insert the entire corpus of documents into the prompt and let the model choose the relevant bits for the query. This approach removes the complexity of the RAG pipeline and the problems caused by recovery errors.

However, initially loading all documents into the prompt presents three major challenges. First, long prompts will slow down the model and increase inference costs. Secondly, the LLM pop-up length sets limits on the number of documents that can contain the prompt. And finally, adding irrelevant information to the prompt can disrupt the model and reduce the quality of its responses. So, just putting all your documents into the prompt instead of choosing the most relevant ones can end up hurting the performance of the template.

The proposed CAG approach leverages three key trends to overcome these challenges.

First, advanced caching techniques make processing prompt patterns faster and less expensive. The principle of CAG is that knowledge documents will be included in every prompt sent to the model. Therefore, you can calculate the attention values of their tokens in advance instead of doing it when receiving requests. This initial calculation reduces the time needed to process user requests.

Major LLM providers such as OpenAI, Anthropic, and Google offer prompt caching features for repetitive parts of your prompt, which can include knowledge documents and instructions that you insert at the beginning of your prompt. With Anthropic you can reduce costs by up to 90% and 85% latency on the cached portions of your prompt. Equivalent caching features have been developed for open source LLM hosting platforms.

Second, Long Context LLM make it easier to incorporate more material and knowledge into the prompts. Claude 3.5 Sonnet supports up to 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini supports up to 2 million tokens. This allows multiple documents or entire books to be included in the prompt.

Finally, advanced training methods allow models to better recover, reason and answer questions over very long sequences. Over the past year, researchers have developed several LLM benchmarks for long-sequence tasks, including BABILong, LongICLBenchAnd RULER. These tests test LLMs on difficult problems such as multiple retrieval and answering multi-skip questions. There is still room for improvement in this area, but AI labs continue to make progress.

As new generations of models continue to expand their pop-ups, they will be able to process larger collections of knowledge. Additionally, we can expect models to continue to improve their abilities to extract and use relevant information from long contexts.

“These two trends will significantly expand the usability of our approach, allowing it to handle more complex and diverse applications,” the researchers write. “Therefore, our methodology is well-positioned to become a robust and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of next-generation LLMs.” »

RAG vs. CAG

To compare RAG and CAG, the researchers conducted experiments on two widely recognized question answering criteria: Teamwhich focuses on contextual questions and answers from single documents, and HotPotQAwhich requires multi-hop reasoning across multiple documents.

They used a Lama-3.1-8B model with a pop-up of 128,000 tokens. For RAG, they combined the LLM with two retrieval systems to obtain passages relevant to the question: the BM25 algorithm And OpenAI Integrations. For CAG, they inserted several documents from the benchmark into the prompt and let the model itself determine which passages to use to answer the question. Their experiments show that CAG outperformed both RAG systems in most situations.

*CAG outperforms both sparse RAG (BM25 retrieval) and dense RAG (OpenAI integrations) (source: arXiv)*

“By preloading the entire context from the test set, our system eliminates retrieval errors and ensures holistic reasoning over all relevant information,” the researchers write. “This advantage is particularly evident in scenarios in which RAG systems might recover incomplete or irrelevant passages, leading to suboptimal response generation.”

CAG also significantly reduces the response generation time, especially as the length of the reference text increases.

*The generation time for CAG is much smaller than for RAG (source: arXiv)*

That said, CAG is not a silver bullet and should be used with caution. It is well suited to contexts where the knowledge base does not change often and is small enough to fit within the model’s pop-up window. Companies should also watch out for cases where their documents contain conflicting facts based on the context of the documents, which could confuse the model during inference.

The best way to determine if CAG is suitable for your use case is to run some experiments. Fortunately, implementing CAG is very simple and should always be considered as a first step before investing in more development-intensive RAG solutions.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.