One interesting feature of our model is that, in contrast to some similar recent recurrent neural network approaches, it is not constrained to a constant rhythm. In order to train it, we took an existing set of midi files, and performed a few simple preprocessing steps in order to extract the music as a set of pitches with arbitrary start and end times.


This set is derived from the four sets of midi files available here.


More recently, we have collected, cleaned, and pre-processed a larger set of 20,006 midi files. Due to the size and complexity of the new dataset, we implemented a number of more sophisticated preprocessing steps.

The source midi files were kindly made available by Pierre Schwob of Classical Archives. We are not permitted to distribute the original midi files, only our derived dataset.

Data Loader

An easy to use loader for python is available here: