CMU Researchers Introduce Sequoia: A Scalable, Robust, and Informed Algorithm for Speculative Decoding

Written By Adarsh Shankar Jha

Prompts

Effective support of LLMs becomes more critical as large language models (LLMs) become widely used. Since obtaining a new token involves obtaining all parameters of LLM, speeding up LLM inference is difficult. The hardware is not used throughout production due to this I/O limitation. Offload-based inference and small-batch inference setups exacerbate this problem because, on current GPUs, generating a single token takes as long as processing a message containing hundreds or thousands of tokens.

Recent research has added speculative decoding to speed up LLM inference while keeping the output distribution of LLM intact, which solves this difficulty. These approaches use one or more rough models to predict the output of the LLM. Predictions are structured in a token tree, with each node representing a different sequence of predicted tokens. A single forward pass of the LLM is then used to verify the correctness of these putative tokens simultaneously. The amount of tokens LLM can accept can be increased by using a token tree instead of a sequence, since there are many alternatives for each token slot.

The researchers show that speculative tree-based decoding approaches have significant limitations, even though there are extensive studies on the subject. First, they find that current algorithms for constructing symbolic trees work well for very small trees, but fall short when dealing with very large ones. Additionally, it has been noted that current symbol tree sampling and verification algorithms do not perform well when applied to different configurations of inference hyperparameters. Finally, current systems fail miserably when optimizing the size and structure of their predicted trees, regardless of the hardware setting. For large hypothetical trees, the assumption that verification time is constant in existing models characterizing speedup from speculative decoding does not hold, making these models useless for determining optimal tree sizes.

Researchers from Carnegie Mellon University, Meta AI, Together AI, and Yandex construct frame trees as a constrained optimization problem and use a dynamic programming method to find the best profitable contract tree. Token generation using this tree structure appears to be unbounded and scales roughly logarithmically with the size of the tree, both in theory and in practice. According to the team, it is important to develop a method for sampling and verifying trees that works well with different inference hyperparameters, doesn’t keep sampling the wrong tokens, and produces accurate results every time. To deal with this, they apply a sampling strategy that does not update the draft model. This prevents the rough model from making the same error twice and keeps the output distribution of the target model intact. This approach is based on the SpecInfer algorithm. They provide empirical evidence that this new sampling and verification approach can achieve high acceptance rates in hot and cold environments.

The team further presents a hardware-aware tree optimizer that examines the hardware-dependent relationship between the number of verified tokens and verification time. This optimizer then uses this relationship to determine the best tree form and depth, thereby addressing the final obstacle. Their findings show that the strategy outperforms non-hardware methods in terms of speed.

They demonstrate Sequoia’s effectiveness with comprehensive end-to-end testing and catalysis research. The Sequoia app uses CUDA graphics and is built on top of Hugging Face (and Accelerate). In offload mode on an L40 GPU, Sequoia can scale to 4.04× for Llama2-7B and 10.33× for Llama2-70B on a single A100 GPU. In addition, catalysis studies have shown that:

The Sequoia tree structure is more scalable than k-independent sequences (tree size ≤ 512), generating up to 33% more tokens per decoding step.
The Sequoia sampling and verification algorithm is temperature and top-p tolerant, providing speedups of up to 65% and 27%, respectively, compared to the SpecInfer and top-k sampling and verification algorithms.
Sequoia’s hardware-aware tree optimizer can automatically determine the optimal tree size and depth for various hardware.

check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….

Dhanshree Shenwai is a Computer Science Engineer with good experience in FinTech companies covering Finance, Cards & Payments and Banking with strong interest in AI applications. He is enthusiastic about exploring new technologies and developments in today’s evolving world that make everyone’s life easy.

🐝 Subscribe to the fastest growing AI research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…

← Prev: This tiny, water-cooled gaming PC is shockingly small | Digital Trends Carte Mère x570 : Le TOP 10 ▷ Comparatif mars 2024 →

OpenBezoar: A Family of Small, Cost-Effective, and Open Source Artificial Intelligence Models Trained on Mixed Instruction Data

The recent success of fine-tuning the teaching of pre-trained Large Language Models (LLMs) for...

Meta Launches Llama-3 Powered Meta AI Chatbot Assistant to Compete with ChatGPT

Meta has officially introduced its new AI assistant, an AI chatbot called Meta AI, powered by...

Unlocking the Recall Power of Large Language Models: Insights from the Needle-in-a-Haystack Test

The rise of Large Language Models (LLM) has revolutionized Natural Language Processing (NLP),...