University of Washington researchers present Fiddler: A Resource-Efficient Inference Engine for LLMs with CPU-GPU Orchestration

Written By Adarsh Shankar Jha

Prompts

Mixture of Experts (MoE) models have revolutionized artificial intelligence by enabling the dynamic allocation of tasks to specialized components in larger models. However, a major challenge for the adoption of MoE models is their deployment in environments with limited computational resources. The sheer size of these models often exceeds the memory capabilities of standard GPUs, limiting their use to low-resource settings. This limitation hinders the effectiveness of the models and challenges researchers and developers who aim to leverage MoE models for complex computational tasks without access to high-end hardware.

Existing methods for developing MoE models in constrained environments typically involve offloading part of the model computation to the CPU. While this approach helps manage GPU memory limitations, it introduces significant latency due to slow data transfers between the CPU and GPU. State-of-the-art MoE models also often use alternative activation functions such as SiLU, which makes it difficult to directly implement sparsity exploitation strategies. Pruning channels that are not close enough to zero could negatively impact model performance, requiring a more sophisticated approach to exploit sparsity.

is2xepXBNOvs0 d LkOHWwSfgK9CXW1oJPzlgbWuNDnN0S9S2EoXtk wefUbPW CuLkqQY8O1Y1lOykF2zAYkesyL ay 491GN0BK8Tl7qbwOeuriS8nsh woTMeKF9izkYJZhuU5nYIuw7CL3Y g

A team of researchers from the University of Washington presented Fiddler, an innovative solution designed to optimize the development of MoE models by efficiently orchestrating CPU and GPU resources. Fiddler minimizes data transfer overhead by running CPU-specific layers, reducing the latency associated with moving data between CPU and GPU. This approach addresses the limitations of existing methods and enhances the feasibility of developing large MoE models in resource-constrained environments.

5HYpeKENVJnEDU7mD0ygyyiMKFu29PLS WxtjxiO43yacX4QGnHQ9agJj6A 45Ht77Lnk 2nYlXAIKXvn1EzXuDfE9k16rkoHnrPzGQvyWW5mgev GNrHpKm9zNMpvA61euGG0Ti6fF3DYTj VTBIJE

Fiddler distinguishes itself by leveraging the computational capabilities of the CPU for special-level processing, while minimizing the amount of data transferred between the CPU and GPU. This methodology drastically reduces CPU-GPU communication latency, allowing the system to run large MoE models, such as Mixtral-8x7B with over 90 GB of parameters, efficiently on a single GPU with limited memory. Fiddler’s design represents a major technical innovation in AI model development.

Fiddler’s effectiveness is underscored by its performance metrics, which show an order of magnitude improvement over traditional offloading methods. Performance is measured by the number of tokens generated per second. Fiddler successfully ran the uncompressed Mixtral-8x7B model in tests, rendering over three tokens per second on a single 24GB GPU. It improves with longer output lengths for the same input length, as the delay of the prefill stage is damped. On average, Fiddler is 8.2 to 10.1 times faster than Eliseev Mazur and 19.4 to 22.5 times faster than DeepSpeed-MII, depending on the environment.

In conclusion, Fiddler represents a significant leap forward for the efficient inference of MoE models in environments with limited computational resources. By intelligently using the CPU and GPU for model extraction, Fiddler overcomes the widespread challenges faced by traditional development methods, offering a scalable solution that improves the accessibility of advanced MoE models. This breakthrough can potentially democratize large-scale artificial intelligence models, paving the way for broader applications and research in artificial intelligence.

check it Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….

Nikhil is a practicing consultant at Marktechpost. He is pursuing a comprehensive dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in areas such as biomaterials and biomedical science. With a strong background in Materials Science, he explores new developments and creates opportunities to contribute.

🚀 LLMWare Launches SLIMs: Small Specialized Models Calling Functions for Multi-Step Automation [Check out all the models]

← Prev: This machine learning study tests the transformer's length generalization ability using the task of adding two integers Final Fantasy composer Nobuo Uematsu is unlikely to score another full game →

OpenBezoar: A Family of Small, Cost-Effective, and Open Source Artificial Intelligence Models Trained on Mixed Instruction Data

The recent success of fine-tuning the teaching of pre-trained Large Language Models (LLMs) for...

Meta Launches Llama-3 Powered Meta AI Chatbot Assistant to Compete with ChatGPT

Meta has officially introduced its new AI assistant, an AI chatbot called Meta AI, powered by...

Unlocking the Recall Power of Large Language Models: Insights from the Needle-in-a-Haystack Test

The rise of Large Language Models (LLM) has revolutionized Natural Language Processing (NLP),...

0 Comments

Trackbacks/Pingbacks

Researchers from Meta AI and UCSD present TOOLVERIFIER: A Generation and Self-Verification Method for Enhancing the Performance of Tool Calls for LLMs | BitRise - […] collaborative research team from Meta and the University of California, San Diego introduces ToolVerifier, a new self-verification method to…
Microsoft Researchers Introduce Garnet: An Open Source Faster Caching System to Accelerate Applications and Services | BitRise - […] Malhotra is a senior from University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with…