--- license: mit language: en tags: - llm - music - multimodal - midi - phi-3 - question-answering - optical-music-recognition model-index: - name: Phi-3-MusiX results: [] datasets: - puar-playground/MusiXQA pipeline_tag: image-text-to-text base_model: - microsoft/Phi-3-vision-128k-instruct library_name: peft --- # Phi-3-MusiX ๐ŸŽต **Phi-3-MusiX** is a LoRA adapter for [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content. - Sorce code: [GitHub](https://github.com/puar-playground/MusiXQA) - Dataset: [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) - Paper: [arXiv](https://arxiv.org/abs/2506.23009) --- ## Inference ``` from transformers import AutoModelForCausalLM from transformers import AutoProcessor from PIL import Image from http import HTTPStatus import torch import requests from io import BytesIO def load_img(img_dir): if img_dir.startswith('http://') or img_dir.startswith('https://'): response = requests.get(img_dir) image = Image.open(BytesIO(response.content)).convert('RGB') else: image = Image.open(img_dir).convert('RGB') return image model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto") processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True) model.load_adapter('puar-playground/Phi-3-MusiX') prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:' # setup message messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}] # load image from dir image = load_img(img_dir) prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda") generation_args = { "max_new_tokens": 500, "temperature": 0.1, "do_sample": False, } with torch.no_grad(): generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) # remove input tokens generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` ## ๐Ÿงช Training Data The model is trained on the [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) dataset, which includes four QA sets: Each entry in the dataset includes: - A scanned music sheet image - Its structured metadata (`metadata.json`) - A MIDI file - QA pair targeting music understanding --- ## ๐ŸŽ“ Reference If you use this dataset in your work, please cite it using the following reference: ``` @misc{chen2025musixqaadvancingvisualmusic, title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models}, author={Jian Chen and Wenye Ma and Penghang Liu and Wei Wang and Tengwei Song and Ming Li and Chenguang Wang and Jiayu Qin and Ruiyi Zhang and Changyou Chen}, year={2025}, eprint={2506.23009}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.23009}, } ```