Results

Jul 2, 2020

Results

The LLINet is trained and evaluated on three tasks:

1.Zero-Shot Cross-Model Retrieval

2.Zero-Shot Sound Localization

3.Zero-Shot Recognition.

Zero-Shot Cross-Model Retrieval

Note that there are many excellent image-text cross-modal retrieval models in the literature. To compare with these image-text models, we replace their text encoders by our audio encoder. For a fair comparison, the input audios for all methods are represented by log Mel filter bank spectrograms as in our method.

Evaluation Metrics: Two-direction mAP, Two-direction R@1

Result:

Model	mAP(I2A)	R@1(I2A)	mAP(A2I)	R@1(A2I)
DAR	13.3	13.4	13.6	15.3
ULSLVC	13.4	8.2	20.1	21.0
SIR	21.2	20.9	15.4	12.8
SCAN	16.5	16.7	18.2	15.0
DSCMR	14.8	20.5	18.1	24.6
TIMAM	39.1	28.5	25.8	25.0
CMPM	39.3	28.9	26.2	25.8
CME	30.1	25.7	27.4	30.8
LLINet	49.3	41.1	31.2	38.3

Visualization:

Zero-Shot Sound Localization

Based on the attention module, the highlight region (with larger value) of the attention map can be taken as the related region of the sound source.

Comparing model: AVOL

Ablation study:

Baseline: Take out the attention module of LLINet, and train the model only with the matching loss
Baseline+AM: Add introduced attention module but train the model only still with the matching loss
LLINet: Based on Baseline+AM and optimized with both the matching loss and attention loss

Evaluation Metrics: mIoU, AUC

Results:

Model	mIoU	AUC
AVOL	26.7	27.1
Baseline	24.1	24.6
Baseline+AM	32.3	32.7
LLINet	36.8	37.1

Visualization:

1.Success plots:

2.Localization:

Zero-Shot Recognition

Prototype Vector: To perform ZSL, we obtain the prototype vector of a class by averaging all audio features from this class, where audio features are extracted by the audio encoder of LLINet.

GZSL: A classifier is trained for the seen classes. During the test process, this classifier is used to estimate whether an image is from seen or unseen classes, via comparing the confidence score with a threshold.

Comparing models: Since this is the first work on zero-shot image recognition based on audio information, we compare LLINet with several representative ZSL methods. we utilize VGGish for comparing models to extract audio features, which is an audio classification model pre-trained on YouTube-100M.

Evaluation metrics:

ZSL: accuracy
GZSL: tr, ts, hm

Results:

Model	acc	tr	ts	hm
DCN	29.8	48.0	26.0	33.7
LDE	51.7	61.7	39.9	48.5
Relation Net	55.4	60.8	40.0	48.0
SAE	57.7	63.7	38.7	48.1
LLINet	56.7	68.6	39.4	50.1
LLINet(VGGish)	62.2	68.7	41.7	51.9