Phi-3-MusiX π΅
Phi-3-MusiX is a LoRA adapter for microsoft/Phi-3-vision-128k-instruct for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.
Inference
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from PIL import Image
from http import HTTPStatus
import torch
import requests
from io import BytesIO
def load_img(img_dir):
if img_dir.startswith('http://') or img_dir.startswith('https://'):
response = requests.get(img_dir)
image = Image.open(BytesIO(response.content)).convert('RGB')
else:
image = Image.open(img_dir).convert('RGB')
return image
model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
model.load_adapter('puar-playground/Phi-3-MusiX')
prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:'
# setup message
messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}]
# load image from dir
image = load_img(img_dir)
prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda")
generation_args = {
"max_new_tokens": 500,
"temperature": 0.1,
"do_sample": False,
}
with torch.no_grad():
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
π§ͺ Training Data
The model is trained on the MusiXQA dataset, which includes four QA sets:
Each entry in the dataset includes:
- A scanned music sheet image
- Its structured metadata (
metadata.json
) - A MIDI file
- QA pair targeting music understanding
π Reference
If you use this dataset in your work, please cite it using the following reference:
@article{chen2025musixqa,
title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models},
author={Chen, Jian and Ma, Wenye and Liu, Penghang and Wang, Wei and Song, Tengwei and Li, Ming and Wang, Chenguang and Zhang, Ruiyi and Chen, Changyou},
journal={arXiv preprint arXiv:2506.23009},
year={2025}
}
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for puar-playground/Phi-3-MusiX
Base model
microsoft/Phi-3-vision-128k-instruct