Google AI introduces an efficient machine learning method for scaling large transformer-based language models (LLMs) on infinitely large inputs

Google AI introduces an efficient machine learning method for scaling large transformer-based language models (LLMs) on infinitely large inputs

Written By Adarsh Shankar Jha

Memory is important to intelligence as it helps recall past experiences and apply them to current situations. However, due to the way their attention mechanism works, both conventional Transformer models and Transformer-based Large Language Models (LLMs) have context-dependent memory limitations. The memory consumption and computation time of this attention mechanism are both quadratic in complexity.

Compressive memory systems present a viable substitute, aiming to be more efficient and scalable for handling very long sequences. Memory compression systems keep storage and computation costs in check by keeping a fixed number of parameters for storing and retrieving information, unlike classical attention mechanisms that need memory to expand with the length of the input sequence.

The goal of this system’s tuning process is to assimilate new information into memory while maintaining its retrievability. However, an efficient memory compression method that compromises simplicity and quality has yet to be adopted by existing LLMs.

To overcome these limitations, a team of researchers from Google proposed a unique solution that allows Transformer LLMs to handle arbitrarily large inputs with limited memory footprint and computing power. A key component of their approach is an attention mechanism known as Infini-attention, which combines long-term linear attention and masked local attention into a single Transformer block and includes compressive memory in the conventional attention process.

The main breakthrough of Infini-attention is its ability to efficiently manage memory while processing long sequences. The model can store and recall data with a fixed set of parameters using compressive memory, which eliminates the requirement to expand memory with the length of the input sequence. This keeps computational costs within reasonable limits and helps control memory consumption.

The team shared that this method has proven effective on a range of tasks, including book digest tasks with 500,000-token input sequences, password frame block recovery for sequences up to 1 million tokens long, and large frame language modeling benchmarks. LLMs with sizes ranging from 1 billion to 8 billion parameters have been used to solve these tasks.

The ability to include minimally bounded memory parameters, i.e. limiting and predicting the memory requirements of the model, is one of the main advantages of this approach. Also, fast flow inference for LLMs is made possible by the proposed approach, which enables efficient real-time or near-real-time analysis of sequential input.

The team summarized their main contributions as follows:

  1. The team presented Infini-attention, a unique attention mechanism that combines local causal attention with long-term compressive memory. This method is both useful and efficient as it effectively represents contextual dependencies at both short and long distances.
  1. The standard dot-scale-product attention mechanism must be slightly modified to accommodate infinite attention. This enables continuous plug-and-play pre-training and long-term customization, and makes integration into current Transformer builds simple.
  1. The method conserves limited memory and computing resources, while allowing Transformer-based LLMs to accommodate infinitely large environments. The approach guarantees optimal resource utilization by processing very large inputs in streaming mode, which allows LLMs to perform well in real-world large-scale data applications.

In conclusion, this study is an important step forward for LLMs, allowing very long inputs to be handled efficiently in terms of computation and memory usage.


check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter. Join us Telegram channel, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us 40k+ ML SubReddits


Want to get in front of 1.5 million AI audience? Work with us here


20220308 160704 1 Tanya

Tanya Malhotra is a senior from University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
He is a Data Science enthusiast with good analytical and critical thinking along with a keen interest in acquiring new skills, leading teams and managing work in an organized manner.


You May Also Like

0 Comments