Results

Results


The LLINet is trained and evaluated on three tasks:

1.Zero-Shot Cross-Model Retrieval

2.Zero-Shot Sound Localization

3.Zero-Shot Recognition.



Zero-Shot Cross-Model Retrieval

Note that there are many excellent image-text cross-modal retrieval models in the literature. To compare with these image-text models, we replace their text encoders by our audio encoder. For a fair comparison, the input audios for all methods are represented by log Mel filter bank spectrograms as in our method.


Evaluation Metrics: Two-direction mAP, Two-direction R@1


Result:

Model mAP(I2A) R@1(I2A) mAP(A2I) R@1(A2I)
DAR 13.3 13.4 13.6 15.3
ULSLVC 13.4 8.2 20.1 21.0
SIR 21.2 20.9 15.4 12.8
SCAN 16.5 16.7 18.2 15.0
DSCMR 14.8 20.5 18.1 24.6
TIMAM 39.1 28.5 25.8 25.0
CMPM 39.3 28.9 26.2 25.8
CME 30.1 25.7 27.4 30.8
LLINet 49.3 41.1 31.2 38.3


Visualization:

retrieval



Zero-Shot Sound Localization

Based on the attention module, the highlight region (with larger value) of the attention map can be taken as the related region of the sound source.


Comparing model: AVOL


Ablation study:

  1. Baseline: Take out the attention module of LLINet, and train the model only with the matching loss
  2. Baseline+AM: Add introduced attention module but train the model only still with the matching loss
  3. LLINet: Based on Baseline+AM and optimized with both the matching loss and attention loss


Evaluation Metrics: mIoU, AUC


Results:

Model mIoU AUC
AVOL 26.7 27.1
Baseline 24.1 24.6
Baseline+AM 32.3 32.7
LLINet 36.8 37.1


Visualization:

1.Success plots:

plot
2.Localization:
localization



Zero-Shot Recognition

Prototype Vector: To perform ZSL, we obtain the prototype vector of a class by averaging all audio features from this class, where audio features are extracted by the audio encoder of LLINet.


GZSL: A classifier is trained for the seen classes. During the test process, this classifier is used to estimate whether an image is from seen or unseen classes, via comparing the confidence score with a threshold.


Comparing models: Since this is the first work on zero-shot image recognition based on audio information, we compare LLINet with several representative ZSL methods. we utilize VGGish for comparing models to extract audio features, which is an audio classification model pre-trained on YouTube-100M.


Evaluation metrics:

  1. ZSL: accuracy
  2. GZSL: tr, ts, hm


Results:

Model acc tr ts hm
DCN 29.8 48.0 26.0 33.7
LDE 51.7 61.7 39.9 48.5
Relation Net 55.4 60.8 40.0 48.0
SAE 57.7 63.7 38.7 48.1
LLINet 56.7 68.6 39.4 50.1
LLINet(VGGish) 62.2 68.7 41.7 51.9