arxiv:2504.15071

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Published on Apr 21

Authors:

Abstract

A new large-scale dataset of MIDI files is created by transcribing audio recordings using a language model and audio classifier, offering detailed analysis and metadata.

AI-generated summary

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.