Papers
arxiv:2408.15237

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Published on Aug 27
· Submitted by akhaliq on Aug 28
#3 Paper of the day

Abstract

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

Community

Paper submitter
·

Funny coincidence - I was just tinkering with a similar idea yesterday. Swapped out the attention blocks in Phi-3.5-mini for RNNs. If anyone's curious, you can check out my experiment here: https://github.com/JosefAlbers/Phi-3-Vision-MLX/blob/main/assets/bytephi.py

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hey, Amazing work :)
We've summarised this and a few other papers in our blog. Hope you like it

  1. KTO: The infamous alignment algorithm
  2. OLMoE: Open Data, Weights, Code Mixture of Experts models
  3. Mamba in the LlaMA: Distilling from Transformers to Mamba
  4. PlanSearch: Improving Code Generation via Planning

https://datta0.substack.com/p/ai-unplugged-19-kto-for-model-alignment

Sign up or log in to comment

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.15237 in a Space README.md to link it from this page.

Collections including this paper 8