Tulu3-RAG
Implementation of paper Block-Attention for Efficient Prefilling.
We introduce Block-attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context in an auto-regressive manner. Instead, Block-attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-attention mechanism. Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of full-attention models, but can also seamlessly switch between the block and full attention modes without any performance loss. Notably, Block-attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the full-attention models, the TTFT and corresponding FLOPs are reduced by 98.7% and 99.8%, respectively.
Additionally, we also elaborate on how Block-attention is applied in Game AI scenario and the substantial potential benefits it entails. We strongly suggest researchers in the gaming field not to overlook our work.
1. Resources
Item | Repository | Train Data |
---|---|---|
Tulu3-Block-FT Model | ๐ค ldsjmdy/Tulu3-Block-FT | ๐พ Google Drive |
Tulu3-SFT Model (Baseline) | ๐ค ldsjmdy/Tulu3-SFT | ๐พ Google Drive |
Tulu3-RAG Model (Baseline) | ๐ค ldsjmdy/Tulu3-RAG | ๐พ Google Drive |
2. Model Usage
Here is an example of how to use the Hugging Face library to perform inference with our model.
- When running in Full-Attention mode
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ldsjmdy/Tulu3-Block-FT"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="gpu:0"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
# same as allenai/Llama-3.1-Tulu-3-8B-SFT
tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{% if not loop.last %}{{ '<|assistant|>\n' + message['content'] + eos_token + '\n' }}{% else %}{{ '<|assistant|>\n' + message['content'] + eos_token }}{% endif %}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|assistant|>\n' }}{% endif %}{% endfor %}"
messages = [
{"role": "user", "content": "There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Based on the characteristic population curves that result from plotting population growth of a species, the most effective means of controlling the mosquito population is to\nA. maintain the population at a point corresponding to the midpoint of its logistic curve\nB. opt for zero population control once the K value of the curve has been reached\nC. reduce the carrying capacity cif the environment to lower the K value\nD. increase the mortality rate\nAnswer:"}
]
prompt = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
- For Block-Attention mode, please refer to the inference code example we provide: block_generate_server.py
By using block_generate_server.py
, you can start a Flask server. You can obtain the generation results to the Block-Attention
method as follows:
CUDA_VISIBLE_DEVICES=0 python3 block_generate_server.py --model <model_nme> --port <port> --dtype bfloat16
import json
import requests
blocks = [
"<|user|>\nYou are an intelligent AI assistant. Please answer questions based on the user's instructions. Below are some reference documents that may help you in answering the user's question.\n\n",
"- Title: Polish-Russian War (film)\nPolish-Russian War(Wojna polsko-ruska) is a 2009 Polish film directed by Xawery \u017bu\u0142awski based on the novel Polish-Russian War under the white-red flag by Dorota Mas\u0142owska.\n",
"- Title: Xawery \u017bu\u0142awski\nXawery \u017bu\u0142awski (born 22 December 1971 in Warsaw) is a Polish film director.In 1995 he graduated National Film School in \u0141\u00f3d\u017a.He is the son of actress Ma\u0142gorzata Braunek and director Andrzej \u017bu\u0142awski.His second feature \"Wojna polsko-ruska\" (2009), adapted from the controversial best-selling novel by Dorota Mas\u0142owska, won First Prize in the New Polish Films competition at the 9th Era New Horizons Film Festival in Wroc\u0142aw.In 2013, he stated he intends to direct a Polish novel \"Z\u0142y\" by Leopold Tyrmand.\u017bu\u0142awski and his wife Maria Strzelecka had 2 children together:son Kaj \u017bu\u0142awski (born 2002) and daughter Jagna \u017bu\u0142awska (born 2009).\n",
"- Title: Viktor Yeliseyev\nViktor Petrovich Yeliseyev( born June 9, 1950) is a Russian general, orchestra conductor and music teacher.He is the director of the Ministry of the Interior Ensemble, one of the two Russian Red Army Choirs.\n- Title: Minamoto no Chikako\nShe was the mother of Prince Morinaga.\n- Title: Alice Washburn\nAlice Washburn( 1860- 1929) was an American stage and film actress.She worked at the Edison, Vitagraph and Kalem studios.Her final film Snow White was her only known feature film.She died of heart attack in November 1929.\n",
"Please write a high-quality answer for the given question using only the provided search documents (some of which might be irrelevant).\nQuestion: Who is the mother of the director of film Polish-Russian War (Film)?\n<|assistant|>\n"
]
r = requests.post(
url="<server url>",
data=json.dumps({"blocks": blocks}),
headers={"Content-Type": "application/json"}
)
print(r.json()["generated"])
Performance
Accuracy of different models on four RAG benchmarks.
Models | 2wiki | HQA | NQ | TQA |
---|---|---|---|---|
Tulu3-SFT | 62.0 | 68.4 | 58.6 | 75.7 |
Tulu3-RAG | 73.2 | 74.8 | 61.5 | 75.8 |
Tulu3-RAG-Superposition | 30.1 | 32.3 | 35.9 | 58.9 |
Tulu3-RAG-promptCache | 32.4 | 31.6 | 44.4 | 61.8 |
Tulu3-block-ft | 72.2 | 72.3 | 60.4 | 75.1 |
Tulu3-block-ft-full | 73.6 | 75.2 | 62.2 | 76.2 |
Tulu3-block-ft-w/o-pos | 68.9 | 69.9 | 59.2 | 74.4 |
Tulu3-block-w/o-ft | 42.9 | 42.1 | 48.3 | 66.5 |
Accuracy of different models on seven general benchmarks.
Models | IFEval (0-shot) | HumanEval (0-shot) | MMLU (0-shot) | GMS8K (4-shot) | MATH (4-shot) | BBH (3-shot) | DROP (3-shot) |
---|---|---|---|---|---|---|---|
Tulu3-SFT | 68.5 | 58.5 | 63.7 | 75.5 | 29.2 | 68.5 | 9.4 |
Tulu3-RAG | 68.3 | 65.2 | 63.6 | 75.6 | 28.6 | 68.5 | 10.4 |
Tulu3-block-ft | 70.0 | 59.1 | 63.0 | 75.7 | 28.8 | 65.3 | 14.4 |
For the first three zero-shot
benchmarks, the Block-attention will fall back to full-attention. For the subsequent four ICL datasets
with few-shot examples, each sample will be divided into an independent block. Therefore, for a
k
-shot sample, it will be divided into k + 1
blocks.
Citation
If you find Block-Attention
useful for your research, please cite our paper:
@inproceedings{
ma2025blockattention,
title={Block-Attention for Efficient Prefilling},
author={Dongyang Ma and Yan Wang and Tian Lan},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=7zNYY1E2fq}
}
- Downloads last month
- 372