Google’s new neural-net LLM architecture separates memory components to control exploding costs of capacity and compute

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

A new neural network architecture developed by Google researchers could solve one of the big challenges of large language models (LLMs): expanding their memory at inference time without exploding memory and computation costs. Called Titansthe architecture allows models to search and store when inferring small, important pieces of information in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to efficiently handle short- and long-term memory tasks. According to the researchers, LLMs that use long-term neural memory can scale to millions of tokens and outperform both conventional LLMs and alternatives such as Mamba while having far fewer parameters.

Attention layers and linear models

The classic transformer architecture used in LLMs uses the self-attention mechanism to calculate relationships between tokens. This is an effective technique for learning complex and granular patterns in sequences of tokens. However, as the length of the sequence increases, the computational and memory costs required to compute and store attention increase quadratically.

More recent proposals involve alternative architectures which have linear complexity and can scale without exploding memory and compute costs. However, Google researchers say that linear models do not exhibit competitive performance compared to traditional transformers because they compress their contextual data and tend to miss important details.

The ideal architecture, they suggest, should have different memory components that can be coordinated to use existing knowledge, remember new facts, and learn abstractions from context.

“We argue that in an effective learning paradigm, similar to [the] In the human brain, there are distinct but interconnected modules, each of which is responsible for a crucial component for the learning process,” the researchers write.

Long-term neural memory

“Memory is a confederation of systems – for example short-term, working and long-term memory – each serving a different function with different neural structures and each capable of functioning independently,” the researchers write.

To address the shortcomings of current language models, the researchers propose a “long-term neural memory” module capable of learning new information at the time of inference without the inefficiencies of the full attention mechanism. Instead of storing information during training, the neural memory module learns a function that can memorize new facts during inference and dynamically adapt the memorization process based on the data encountered. This solves the generalization problem that other neural network architectures suffer from.

To decide what information is worth storing, the neural memory module uses the concept of “surprise.” The more a sequence of tokens differs from the type of information stored in the model weights and in existing memory, the more surprising it is and therefore worth remembering. This allows the module to use its limited memory efficiently and store only data elements that add useful information to what the model already knows.

To handle very long sequences of data, the neural memory module has an adaptive forgetting mechanism that allows it to remove information that is no longer needed, helping to manage the limited capacity of the memory.

The memory module may be complementary to the attention mechanism of current transformer models, which the researchers describe as “short-term memory modules, attending to the size of the current context window.” On the other hand, our neural memory, with the ability to continually learn data and store it in its weights, can play the role of long-term memory.

Architecture of the Titans

Example of Titan architecture (source: arXiv)

The researchers describe the Titans as a family of models integrating existing transformer blocks with neural memory modules. The model has three key components: the “kernel” module, which acts as a short-term memory and uses the classical attention mechanism to attend to the current segment of input tokens that the model processes; a “long-term memory” module, which uses neural memory architecture to store information beyond the current context; and a “persistent memory” module, learnable parameters that remain fixed after training and store time-independent knowledge.

Researchers propose different ways to connect the three components. But generally speaking, the main advantage of this architecture is to allow the attention and memory modules to complement each other. For example, attention layers can use historical and current context to determine which parts of the current pop-up window should be stored in long-term memory. At the same time, long-term memory provides historical knowledge that is not present in the current context of attention.

Researchers performed small-scale tests on Titan models, ranging from 170 million to 760 million parameters, on a wide range of tasks, including language modeling and long-sequence language tasks. They compared the performance of the Titans to various transformer-based models, linear models such as Mamba And hybrid models like Samba.

Titans (red line) outperform other models, including GPT-4, on long-sequence tasks in both few and fine-tuned settings (source: arXiv)

Titans demonstrated strong language modeling performance compared to other models and outperformed transformers and linear models of similar sizes.

The difference in performance is particularly pronounced in tasks with long sequences, such as “needle in a haystack“, where the model must recover bits of information from a very long sequence, and BABILongwhere the model must reason from facts distributed in very long documents. In fact, in these tasks, Titan outperformed models with orders of magnitude more parameters, including GPT-4 and GPT-4o-miniand an upgraded Llama-3 model with Recovery Augmented Generation (RAG).

Additionally, the researchers were able to expand the Titans popup up to 2 million tokens while keeping memory costs modest.

The models still need to be tested at larger sizes, but the paper’s results show that researchers still haven’t reached the ceiling of the Titans’ potential.

What does this mean for enterprise applications?

With Google being at the forefront of long context modelsthis technique can be expected to find its way into private and open models such as Gemini and Gemma.

With LLMs supporting longer pop-ups, there is growing potential to create applications where you integrate new knowledge into your prompt instead of using techniques such as RAG. The development cycle for developing and iterating on prompt-based applications is much faster than complex RAG pipelines. At the same time, architectures like Titans can help reduce inference costs for very long sequences, allowing businesses to deploy LLM applications for more use cases.

Google plans to release PyTorch and JAX code to train and evaluate Titans models.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.