Dataset
INSTRUMENTS_32CLASS
INSTRUMENTS_32CLASS is a new collocted instruments playing dataset, which is subset of AudioSet. The form of data is 10 sec excerpets from Youtube.
In order to get a less noisy and well-labeld dataset for zero-shot learning, instrument solo playing videos with correct audio-visual correspondence were picked manually.
Preprocessing
In view of the fact that the information of an instrument is mainly contained in its timbre while is less relevant to the length of time, we thus cropped audios from 10s to 3s. With respect to videos, corresponding intermediate image frames are selected as their representatives.
The samples of dataset can be seen as follow:
Training / Test set division
Training set : 24 instrument categories with 2857 image-audio pairs
Test set : 8 instrument categories with 747 pairs.
Segmentation for sound localization
We provide the instrument mask of each image in the test set, so that the dataset can be used for quantitative evaluation of sound localization.
Details Statistics
The detail statistics of the dataset and some samples of image-audio pairs and image masks are shown as follow: