Tulu3-RAG

Implementation of paper Block-Attention for Efficient Prefilling.

We introduce Block-attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context in an auto-regressive manner. Instead, Block-attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-attention mechanism. Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of full-attention models, but can also seamlessly switch between the block and full attention modes without any performance loss. Notably, Block-attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the full-attention models, the TTFT and corresponding FLOPs are reduced by 98.7% and 99.8%, respectively.

Additionally, we also elaborate on how Block-attention is applied in Game AI scenario and the substantial potential benefits it entails. We strongly suggest researchers in the gaming field not to overlook our work.

1. Resources

Item	Repository	Train Data
Tulu3-Block-FT Model	🤗 ldsjmdy/Tulu3-Block-FT	💾 Google Drive
Tulu3-SFT Model (Baseline)	🤗 ldsjmdy/Tulu3-SFT	💾 Google Drive
Tulu3-RAG Model (Baseline)	🤗 ldsjmdy/Tulu3-RAG	💾 Google Drive

2. Model Usage

Here is an example of how to use the Hugging Face library to perform inference with our model.

When running in Full-Attention mode

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ldsjmdy/Tulu3-Block-FT"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="gpu:0"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
# same as allenai/Llama-3.1-Tulu-3-8B-SFT
tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{% if not loop.last %}{{ '<|assistant|>\n'  + message['content'] + eos_token + '\n' }}{% else %}{{ '<|assistant|>\n'  + message['content'] + eos_token }}{% endif %}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|assistant|>\n' }}{% endif %}{% endfor %}"
messages = [
    {"role": "user", "content": "There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: Based on the characteristic population curves that result from plotting population growth of a species, the most effective means of controlling the mosquito population is to\nA. maintain the population at a point corresponding to the midpoint of its logistic curve\nB. opt for zero population control once the K value of the curve has been reached\nC. reduce the carrying capacity cif the environment to lower the K value\nD. increase the mortality rate\nAnswer:"}
]
prompt = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

For Block-Attention mode, please refer to the inference code example we provide: block_generate_server.py

By using block_generate_server.py, you can start a Flask server. You can obtain the generation results to the Block-Attention method as follows:

CUDA_VISIBLE_DEVICES=0 python3 block_generate_server.py --model <model_nme> --port <port> --dtype bfloat16

import json
import requests
blocks = [
    "<|user|>\nYou are an intelligent AI assistant. Please answer questions based on the user's instructions. Below are some reference documents that may help you in answering the user's question.\n\n",
    "- Title: Polish-Russian War (film)\nPolish-Russian War(Wojna polsko-ruska) is a 2009 Polish film directed by Xawery \u017bu\u0142awski based on the novel Polish-Russian War under the white-red flag by Dorota Mas\u0142owska.\n",
    "- Title: Xawery \u017bu\u0142awski\nXawery \u017bu\u0142awski (born 22 December 1971 in Warsaw) is a Polish film director.In 1995 he graduated National Film School in \u0141\u00f3d\u017a.He is the son of actress Ma\u0142gorzata Braunek and director Andrzej \u017bu\u0142awski.His second feature \"Wojna polsko-ruska\" (2009), adapted from the controversial best-selling novel by Dorota Mas\u0142owska, won First Prize in the New Polish Films competition at the 9th Era New Horizons Film Festival in Wroc\u0142aw.In 2013, he stated he intends to direct a Polish novel \"Z\u0142y\" by Leopold Tyrmand.\u017bu\u0142awski and his wife Maria Strzelecka had 2 children together:son Kaj \u017bu\u0142awski (born 2002) and daughter Jagna \u017bu\u0142awska (born 2009).\n",
    "- Title: Viktor Yeliseyev\nViktor Petrovich Yeliseyev( born June 9, 1950) is a Russian general, orchestra conductor and music teacher.He is the director of the Ministry of the Interior Ensemble, one of the two Russian Red Army Choirs.\n- Title: Minamoto no Chikako\nShe was the mother of Prince Morinaga.\n- Title: Alice Washburn\nAlice Washburn( 1860- 1929) was an American stage and film actress.She worked at the Edison, Vitagraph and Kalem studios.Her final film Snow White was her only known feature film.She died of heart attack in November 1929.\n",
    "Please write a high-quality answer for the given question using only the provided search documents (some of which might be irrelevant).\nQuestion: Who is the mother of the director of film Polish-Russian War (Film)?\n<|assistant|>\n"
]
r = requests.post(
    url="<server url>",
    data=json.dumps({"blocks": blocks}),
    headers={"Content-Type": "application/json"}
)
print(r.json()["generated"])

Performance

Accuracy of different models on four RAG benchmarks.

Models	2wiki	HQA	NQ	TQA
Tulu3-SFT	62.0	68.4	58.6	75.7
Tulu3-RAG	73.2	74.8	61.5	75.8
Tulu3-RAG-Superposition	30.1	32.3	35.9	58.9
Tulu3-RAG-promptCache	32.4	31.6	44.4	61.8
Tulu3-block-ft	72.2	72.3	60.4	75.1
Tulu3-block-ft-full	73.6	75.2	62.2	76.2
Tulu3-block-ft-w/o-pos	68.9	69.9	59.2	74.4
Tulu3-block-w/o-ft	42.9	42.1	48.3	66.5

Accuracy of different models on seven general benchmarks.

Models	IFEval (0-shot)	HumanEval (0-shot)	MMLU (0-shot)	GMS8K (4-shot)	MATH (4-shot)	BBH (3-shot)	DROP (3-shot)
Tulu3-SFT	68.5	58.5	63.7	75.5	29.2	68.5	9.4
Tulu3-RAG	68.3	65.2	63.6	75.6	28.6	68.5	10.4
Tulu3-block-ft	70.0	59.1	63.0	75.7	28.8	65.3	14.4

For the first three zero-shot benchmarks, the Block-attention will fall back to full-attention. For the subsequent four ICL datasets with few-shot examples, each sample will be divided into an independent block. Therefore, for a k-shot sample, it will be divided into k + 1 blocks.

Citation

If you find Block-Attention useful for your research, please cite our paper:

@inproceedings{
    ma2025blockattention,
    title={Block-Attention for Efficient Prefilling},
    author={Dongyang Ma and Yan Wang and Tian Lan},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=7zNYY1E2fq}
}

ldsjmdy
/

Tulu3-RAG

Tulu3-RAG

1. Resources

2. Model Usage

Performance

Accuracy of different models on four RAG benchmarks.

Accuracy of different models on seven general benchmarks.

Citation

Model tree for ldsjmdy/Tulu3-RAG

Dataset used to train ldsjmdy/Tulu3-RAG

Collection including ldsjmdy/Tulu3-RAG

Block-Attention