Phi-3-MusiX 🎡

Phi-3-MusiX is a LoRA adapter for microsoft/Phi-3-vision-128k-instruct for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.


Inference

from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor
from PIL import Image 
from http import HTTPStatus
import torch
import requests
from io import BytesIO

def load_img(img_dir):
  if img_dir.startswith('http://') or img_dir.startswith('https://'):
      response = requests.get(img_dir)
      image = Image.open(BytesIO(response.content)).convert('RGB')
  else:
      image = Image.open(img_dir).convert('RGB')
  return image


model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
model.load_adapter('puar-playground/Phi-3-MusiX')


prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:'

# setup message
messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}]

# load image from dir
image = load_img(img_dir)

prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda")

generation_args = { 
    "max_new_tokens": 500,
    "temperature": 0.1,
    "do_sample": False,
}

with torch.no_grad():
    generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        

πŸ§ͺ Training Data

The model is trained on the MusiXQA dataset, which includes four QA sets:

Each entry in the dataset includes:

  • A scanned music sheet image
  • Its structured metadata (metadata.json)
  • A MIDI file
  • QA pair targeting music understanding

πŸŽ“ Reference

If you use this dataset in your work, please cite it using the following reference:

@article{chen2025musixqa,
  title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models},
  author={Chen, Jian and Ma, Wenye and Liu, Penghang and Wang, Wei and Song, Tengwei and Li, Ming and Wang, Chenguang and Zhang, Ruiyi and Chen, Changyou},
  journal={arXiv preprint arXiv:2506.23009},
  year={2025}
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for puar-playground/Phi-3-MusiX

Adapter
(6)
this model

Dataset used to train puar-playground/Phi-3-MusiX