MLNews

StreamingLLM: Large Language Models Can Process Infinite-Length Inputs with powerful StreamingLLM

There is an urgent demand for something unique in a world where communications unfold like delicate dances and words paint stories across digital domains. Consider the power of Large Language Models (LLMs) at the center of multi-round dialogues, where exchanges can go on indefinitely.

And now, a hero rises from the shadows under the mask of “StreamingLLM.” It’s an example of efficiency, a gateway to endless possibilities. Massachusetts Institute of Technology, Meta AI, and Carnegie Mellon University are involved in the research study of StreamingLLM.

StreamingLLM is about leveraging large language models in real-time conversations, such as chatbots, when individuals say a lot. They are attempting to solve two major issues: Remembering what was stated earlier in the conversation requires a lot of computer memoryLong conversations are difficult for these models to comprehend. They devised a new method to improve its performance, termed StreamingLLM.

It enables these models to conduct very long discussions without requiring a large amount of memory or any additional adjustments. They tested it with several models and discovered that it is significantly faster than previous methods when utilized in live talks. They also discovered that adding a specific token improves it even more.

Prior models limitations

The prior limitations related to applying Large Language Models (LLMs) to lengthy texts can be summarized as follows first is Length Extrapolation which is concerned with allowing language models trained on shorter texts to handle longer ones during testing. While various research initiatives attempted to achieve this, such as Rotary Position Embeddings (RoPE) and ALiBi, they fell short of attaining infinite length extrapolation. Because of this constraint, no existing LLMs were acceptable for streaming applications.

Prior models also work to enlarge the context window of LLMs so that they can analyze more tokens in a single forward pass. Various strategies, such as system-focused optimizations and approximate attention approaches, have been investigated. However, these solutions merely extended the context window of LLMs to a limited extent, failing to address the core challenge of dealing with infinite inputs.

Language modeling perplexity on texts with 20K tokens across various LLM

Prior also focuses on optimizing LLMs to efficiently capture and exploit content within lengthy contexts, rather than simply taking them as inputs. However, attaining effective use of extended contexts within LLMs has remained a difficulty.

Introduction to the StreamingLLM model

Large Language Models (LLMs) are super-smart language tools that are utilized in a variety of applications such as chatbots, document summarization, code completion, and question answering. However, they have a problem with really extended talks or SMS.

Consider having a chatbot that can chat for an entire day without tiring. That’s the fantasy! But it’s difficult for LLMs because, like students, they have limited desk space in their classes. They can only focus on a limited number of things at once. People have attempted to make their workstations larger in order to improve their training, but there is a limit to how large the desks can become. So, how can we get them to work for endless talks and texts?

In this research, they discussed “LLM streaming applications.” They wonder if they can make LLMs work with extremely long texts without slowing them down. When using LLMs for extremely long tasks, two issues arise:

  1. They must remember everything, which requires a significant amount of memory and time.
  2. They begin to make mistakes when the text exceeds the length of their “desk.”
Illustration of StreamingLLM vs. existing methods

One possibility is to employ a “sliding window” for their desks. Consider a window that glides along the text and only shows what’s inside. However, when the text becomes too long, even this strategy fails.

They discovered something interesting. LLMs pay close attention to the first things they understand, even if they are unimportant. These are referred to as “attention sinks.” It’s as if people can’t stop looking at the opening few pages of a book, even if they aren’t important to the plot.

As a result, they devised StreamingLLM, a creative approach. Rather than building larger desks, they keep the first couple of pages and simply shift a small window along the text. This makes LLMs ideal for extremely long documents of 4 million words or more. And it’s a lot faster than other methods.

They also discovered that LLMs can be trained to just require one “attention sink” token, making them even better for streaming. It’s similar to educating children to focus on what’s important rather than becoming distracted by the first few pages of the book.

Visualization of the average attention logits in Llama-2-7B over 256 sentences, each with a length of 16.
Visualization of the average attention logits in Llama-2-7B over 256 sentences,
each with a length of 16.

Future scope of StreamingLLM model

The future potential of the insights supplied by this data is both interesting and different. It would also be beneficial to investigate how the Rolling KV Cache with Attention Sinks can be seamlessly integrated into existing LLM designs, perhaps opening the door to increased text processing capabilities. 

Furthermore, the study of attention mechanisms and their significance in LLMs is an active topic of research, with prospects to optimize attention computation for a variety of natural language processing applications. Addressing the limitations indicated in this data and improving the proposed solutions could have a substantial impact on the deployment and performance of these models in the future, as LLMs become increasingly important in a wide range of applications.

Detailed research study and code accessibility

The research paper and all the detailed data about this model are available on Arxiv. For any person who is interested in StreamingLLM implementation and working, code is also available on GitHub and code is freely available for use. All this data is publicly available and anyone can have access to research paper and code.

Potential applications of StreamingLLM model

In the domain of artificial intelligence for conversation, where lengthy dialogues and multi-round interactions are common, the introduction of attention flows can allow chatbots and virtual assistants to maintain context during long talks without experiencing significant performance loss. Conversational AI systems may become more practical and user-friendly as a result of more natural and engaging interactions with people.

LLMs who have this technique for document summaries are able to more effectively understand and distill content from lengthy papers, boosting the quality and relevancy of created summaries. The capacity to manage substantial context in content creation activities, such as automated article writing or code completion, can result in more coherent and contextually accurate output.

Furthermore, this approach may be useful in tasks involving large-scale dataset processing or continuously streaming text, such as real-time news analysis, financial market monitoring, and social media sentiment analysis, where LLMs are capable of extracting insights from vast and continuous streams of textual data while maintaining high performance and efficiency. Overall, these breakthroughs’ applications span a wide range of sectors, contributing to the growth of language models in practical, data-heavy applications.

Why window attention fails but attention sink is important?

The drawbacks of the “window attention” technique are discussed in this section, as is the introduction of a unique idea known as “attention sinks.” When it comes to maintaining the model’s performance, the “window attention” technique, which was designed to improve efficiency during text processing, falls short.

The main problem emerges when the length of the text exceeds the cache capacity, resulting in a significant decrease in the model’s effectiveness. Surprisingly, the beginning tokens, which are frequently far from the current words being projected, turn out to be important for the model’s stability.

Perplexities are evaluated on 400K tokens in the concatenated PG19 test set

Attention maps and model layers are investigated to dig deeper into the causes of the model’s focus on these early tokens. Beyond the first two levels, the results show that the model consistently prioritizes the initial tokens throughout all layers. This emphasis on early tokens arises from their participation in the SoftMax function, which is a fundamental component of attention computation and requires nonzero values for tokens.

This novel idea is introduced to explain why the model favors early tokens disproportionately, regardless of their semantic relevance. Attention sinks are tokens that receive extra attention from the model as a result of the SoftMax function’s nature. Interestingly, given their visibility to all subsequent tokens, LLMs prefer to focus on these initial tokens, making them perfect candidates.

The KV cache of StreamingLLM
The KV cache of StreamingLLM

Furthermore, LLMs are known to use numerous initial tokens as attention sinks rather than simply one. This is because, during pre-training, there was no consistent beginning token across all input samples. The authors hypothesize that by including a stable, learnable token at the start of all training samples, the necessity for numerous initial tokens can be eliminated, which they want to test further.

In response to these discoveries, the researchers offer “Rolling KV Cache with Attention Sinks” as a way to enable LLM streaming without the need for model fine-tuning. This method reintroduces starting token memory with current token memory in the attention computation, to stabilize the attention mechanism.

 comparing models
pre-trained with (left) and without (right) a sink token.
Comparing models pre-trained with (left) and without (right) a sink token.

They also urge for a modification in the pre-training procedure for future LLMs. They recommend training models with a certain “Sink Token” contained in all training data. According to their experiments, this change considerably stabilizes the attention mechanism and improves the model’s performance in streaming deployments.

Experimental results pre-training with sink token

The research paper compares StreamingLLM to four current model families: Llama-2, MPT, Pythia, and Falcon, all of which use different position encoding approaches. The goal of this study is to evaluate StreamingLLM’s performance in language modeling on large texts as well as its capacity to handle streaming question-answering tasks. Notably, StreamingLLM is compared to known benchmarks such as dense attention and window attention. The results reveal that StreamingLLM outperforms the baseline approaches in terms of perplexity, even when dealing with extremely large texts of more than 4 million tokens.

Furthermore, the study shows that including a sink token during pre-training has no negative impact on model completion or performance on key NLP benchmarks. This implies that StreamingLLM may handle streaming question-answering workloads effectively and complements context extension approaches by increasing the maximum cache capacity.

Comparison of per-token decoding latency and memory usage between the sliding window
approach with re-computation baseline and StreamingLLM,
Comparison of per-token decoding latency and memory usage between the sliding window
approach with re-computation baseline and StreamingLLM,

Furthermore, the study investigates ablation research in order to establish the appropriate number of beginning tokens for attention sinks, concluding that four initial tokens are sufficient. Experiments with cache size show that increasing cache size does not always result in decreasing perplexity, highlighting possible areas for future development in leveraging extensive contexts.

Finally, the efficiency of StreamingLLM is compared to a sliding window with a re-computation baseline, demonstrating that StreamingLLM delivers a considerable reduction in decoding latency (up to 22.2 per token) while retaining a memory footprint comparable to the baseline. These findings illustrate StreamingLLM’s usefulness and efficiency in handling long texts and streaming NLP jobs.

Conclusion

Deploying LLMs in streaming applications is critical, but it is filled with difficulties due to efficiency constraints and lower performance with longer texts. When initial tokens are excluded, window attention provides a partial solution, but its performance decreases. Recognizing the importance of these tokens as “attention sinks,” they developed StreamingLLM, a simple and efficient architecture that allows LLMs to handle limitless texts without fine-tuning.

StreamingLLM can accurately model texts of up to 4 million tokens by combining attention sinks with recent tokens. Furthermore, they demonstrate that pre-training models with a dedicated sink token can increase streaming performance. StreamingLLM first decouples the LLM’s pre-training window size from its real text creation length, allowing for LLM streaming deployment.

References

https://github.com/mit-han-lab/streaming-llm

https://browse.arxiv.org/pdf/2309.17453v1.pdf


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development