University of Washington researchers present Fiddler: A Resource-Efficient Inference Engine for LLMs with CPU-GPU Orchestration

Written By Adarsh Shankar Jha

Mixture of Experts (MoE) models have revolutionized artificial intelligence by enabling the dynamic allocation of tasks to specialized components in larger models. However, a major challenge for the adoption of MoE models is their deployment in environments with limited computational resources. The sheer size of these models often exceeds the memory capabilities of standard GPUs, limiting their use to low-resource settings. This limitation hinders the effectiveness of the models and challenges researchers and developers who aim to leverage MoE models for complex computational tasks without access to high-end hardware.

Existing methods for developing MoE models in constrained environments typically involve offloading part of the model computation to the CPU. While this approach helps manage GPU memory limitations, it introduces significant latency due to slow data transfers between the CPU and GPU. State-of-the-art MoE models also often use alternative activation functions such as SiLU, which makes it difficult to directly implement sparsity exploitation strategies. Pruning channels that are not close enough to zero could negatively impact model performance, requiring a more sophisticated approach to exploit sparsity.

is2xepXBNOvs0 d LkOHWwSfgK9CXW1oJPzlgbWuNDnN0S9S2EoXtk wefUbPW CuLkqQY8O1Y1lOykF2zAYkesyL ay 491GN0BK8Tl7qbwOeuriS8nsh woTMeKF9izkYJZhuU5nYIuw7CL3Y g

A team of researchers from the University of Washington presented Fiddler, an innovative solution designed to optimize the development of MoE models by efficiently orchestrating CPU and GPU resources. Fiddler minimizes data transfer overhead by running CPU-specific layers, reducing the latency associated with moving data between CPU and GPU. This approach addresses the limitations of existing methods and enhances the feasibility of developing large MoE models in resource-constrained environments.

5HYpeKENVJnEDU7mD0ygyyiMKFu29PLS WxtjxiO43yacX4QGnHQ9agJj6A 45Ht77Lnk 2nYlXAIKXvn1EzXuDfE9k16rkoHnrPzGQvyWW5mgev GNrHpKm9zNMpvA61euGG0Ti6fF3DYTj VTBIJE

Fiddler distinguishes itself by leveraging the computational capabilities of the CPU for special-level processing, while minimizing the amount of data transferred between the CPU and GPU. This methodology drastically reduces CPU-GPU communication latency, allowing the system to run large MoE models, such as Mixtral-8x7B with over 90 GB of parameters, efficiently on a single GPU with limited memory. Fiddler’s design represents a major technical innovation in AI model development.

jpuO0 PnH

Fiddler’s effectiveness is underscored by its performance metrics, which show an order of magnitude improvement over traditional offloading methods. Performance is measured by the number of tokens generated per second. Fiddler successfully ran the uncompressed Mixtral-8x7B model in tests, rendering over three tokens per second on a single 24GB GPU. It improves with longer output lengths for the same input length, as the delay of the prefill stage is damped. On average, Fiddler is 8.2 to 10.1 times faster than Eliseev Mazur and 19.4 to 22.5 times faster than DeepSpeed-MII, depending on the environment.

In conclusion, Fiddler represents a significant leap forward for the efficient inference of MoE models in environments with limited computational resources. By intelligently using the CPU and GPU for model extraction, Fiddler overcomes the widespread challenges faced by traditional development methods, offering a scalable solution that improves the accessibility of advanced MoE models. This breakthrough can potentially democratize large-scale artificial intelligence models, paving the way for broader applications and research in artificial intelligence.


check it Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….


Bio picture Nikhil

Nikhil is a practicing consultant at Marktechpost. He is pursuing a comprehensive dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in areas such as biomaterials and biomedical science. With a strong background in Materials Science, he explores new developments and creates opportunities to contribute.


You May Also Like

0 Comments

Trackbacks/Pingbacks

  1. Researchers from Meta AI and UCSD present TOOLVERIFIER: A Generation and Self-Verification Method for Enhancing the Performance of Tool Calls for LLMs | BitRise - […] collaborative research team from Meta and the University of California, San Diego introduces ToolVerifier, a new self-verification method to…
  2. Microsoft Researchers Introduce Garnet: An Open Source Faster Caching System to Accelerate Applications and Services | BitRise - […] Malhotra is a senior from University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with…