Dataset

INSTRUMENTS_32CLASS


INSTRUMENTS_32CLASS is a new collocted instruments playing dataset, which is subset of AudioSet. The form of data is 10 sec excerpets from Youtube.

In order to get a less noisy and well-labeld dataset for zero-shot learning, instrument solo playing videos with correct audio-visual correspondence were picked manually.



Preprocessing

In view of the fact that the information of an instrument is mainly contained in its timbre while is less relevant to the length of time, we thus cropped audios from 10s to 3s. With respect to videos, corresponding intermediate image frames are selected as their representatives.


The samples of dataset can be seen as follow:

Double bass00001 Electric guitar00001 Steel guitar00003 Glockenspiel00006
Didgeridoo00001 French horn00001 Shofar00005 Tabla00001



Training / Test set division

Training set : 24 instrument categories with 2857 image-audio pairs

Test set : 8 instrument categories with 747 pairs.



Segmentation for sound localization

We provide the instrument mask of each image in the test set, so that the dataset can be used for quantitative evaluation of sound localization.



Details Statistics

The detail statistics of the dataset and some samples of image-audio pairs and image masks are shown as follow:

segmentationdataset