In this paper, researchers introduce Solos, a clean dataset of solo musical performances for training machine learning models on various audio-visual tasks.
(1) Juan F. Montesinos, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};

(2) Olga Slizovskaia, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};

(3) Gloria Haro, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]}.


We have presented Solos, a new audio-visual dataset of music recordings of soloists, suitable for different self-supervised learning tasks such as source separation using the mix-and-separate strategy, sound localization, cross-modal generation and finding audio-visual correspondences. There are 13 different instruments in the dataset; those are common instruments in chamber orchestras and the ones included in the University of Rochester Multi-Modal Music Performance (URMP) dataset [1]. The characteristics of URMP – small dataset of real performances with ground truth individual stems – make it a suitable dataset for testing purposes but to the best of our knowledge, to date there is no existing large-scale dataset with the same instruments as in URMP. Two different networks for audio-visual source separation based on the U-Net architecture have been trained in the new dataset and further evaluated in URMP, showing the impact of training on the same set of instruments as the test set. Moreover, Solos provides skeletons and timestamps to video intervals where hands are sufficiently visible. This information could be useful for training purposes and also for learning to solve the task of sound localization.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.