ASR Fuser

You can use the model to combine ASR outputs from different systems at a document-level. The LLM is fine-tuned to predict the transcript understanding the context and using the cues from different models. The data for training is natural and also synthetically generated from IWSLT training data to have document-level outputs

Model Usage

Model Loading

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
token = {your token here for LLM}
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto",attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, token=token)
model.load_adapter("skoneru/iwslt_asr_fuser")

Prompt Format

Post-Edit the Automatic Speech Recognition Transcripts from different systems understanding the context.
ASR Transcripts:
System 1:
{system 1 outputs here}
System 2:
{system 2 outputs here}
System 3:
{system 3 outputs here}
System 4:
{system 4 outputs here}
Post-Edited Transcript:
{llm to generate}

Note that during training, we generated data from 4 systems and the best system was system 4. Therefore, the model relies on system 4 if it cannot decide and should be the best system.

Model Inference

After loading the model and the tokenizer, you can simply use the model with the prompt format as shown below:

system1= "Embeddings such as word to back are very famous"
system2= "Embeddings such as bird to back are very famous"
system3= "Embeddings such as bird to back, are very famous"
system4= "Embeddings such as bird to back are very famous"

prompt = ["Post-Edit the Automatic Speech Recognition Transcripts from different systems understanding the context.\nASR Transcripts:\nSystem 1:\n" + system1 + "\nSystem 2:\n" + system2 + "\nSystem 3:\n" + system3 + "\nSystem 4:\n" + system4 + \nPost-Edited Transcript:\n"]
inputs = tokenizer(prompt, return_tensors="pt", padding=True, add_special_tokens=False).to(model.device)
num_beams=5

output = model.generate(**inputs, num_beams=num_beams, max_new_tokens=2048, return_dict_in_generate=True, early_stopping=True, do_sample=False)
hyps = tokenizer.batch_decode(output.sequences[:,inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(hyps)

You should see "Embeddings such as word2vec are very famous." which is the correct output based on the context and formatting style.

📖 Citation

If you use this model in your research, please cite:

@inproceedings{koneru2025kit,
  title={KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025},
  author={Koneru, Sai and Z{\"u}fle, Maike and Nguyen, Thai-Binh and Akti, Seymanur and Niehues, Jan and Waibel, Alexander},
  journal={arXiv preprint arXiv:2505.13036},
  year={2025},
  url={https://arxiv.org/abs/2505.13036}
}

skoneru
/

iwslt_asr_fuser

ASR Fuser

Model Usage

Model Loading

Prompt Format

Model Inference

📖 Citation

Model tree for skoneru/iwslt_asr_fuser

Collection including skoneru/iwslt_asr_fuser

KIT IWSLT25 Offline

Evaluation results