Results
Results
The LLINet is trained and evaluated on three tasks:
1.Zero-Shot Cross-Model Retrieval
2.Zero-Shot Sound Localization
3.Zero-Shot Recognition.
Zero-Shot Cross-Model Retrieval
Note that there are many excellent image-text cross-modal retrieval models in the literature. To compare with these image-text models, we replace their text encoders by our audio encoder. For a fair comparison, the input audios for all methods are represented by log Mel filter bank spectrograms as in our method.
Evaluation Metrics: Two-direction mAP, Two-direction R@1
Result:
Model | mAP(I2A) | R@1(I2A) | mAP(A2I) | R@1(A2I) |
---|---|---|---|---|
DAR | 13.3 | 13.4 | 13.6 | 15.3 |
ULSLVC | 13.4 | 8.2 | 20.1 | 21.0 |
SIR | 21.2 | 20.9 | 15.4 | 12.8 |
SCAN | 16.5 | 16.7 | 18.2 | 15.0 |
DSCMR | 14.8 | 20.5 | 18.1 | 24.6 |
TIMAM | 39.1 | 28.5 | 25.8 | 25.0 |
CMPM | 39.3 | 28.9 | 26.2 | 25.8 |
CME | 30.1 | 25.7 | 27.4 | 30.8 |
LLINet | 49.3 | 41.1 | 31.2 | 38.3 |
Visualization:
Zero-Shot Sound Localization
Based on the attention module, the highlight region (with larger value) of the attention map can be taken as the related region of the sound source.
Comparing model: AVOL
Ablation study:
- Baseline: Take out the attention module of LLINet, and train the model only with the matching loss
- Baseline+AM: Add introduced attention module but train the model only still with the matching loss
- LLINet: Based on Baseline+AM and optimized with both the matching loss and attention loss
Evaluation Metrics: mIoU, AUC
Results:
Model | mIoU | AUC |
---|---|---|
AVOL | 26.7 | 27.1 |
Baseline | 24.1 | 24.6 |
Baseline+AM | 32.3 | 32.7 |
LLINet | 36.8 | 37.1 |
Visualization:
1.Success plots:
Zero-Shot Recognition
Prototype Vector: To perform ZSL, we obtain the prototype vector of a class by averaging all audio features from this class, where audio features are extracted by the audio encoder of LLINet.
GZSL: A classifier is trained for the seen classes. During the test process, this classifier is used to estimate whether an image is from seen or unseen classes, via comparing the confidence score with a threshold.
Comparing models: Since this is the first work on zero-shot image recognition based on audio information, we compare LLINet with several representative ZSL methods. we utilize VGGish for comparing models to extract audio features, which is an audio classification model pre-trained on YouTube-100M.
Evaluation metrics:
- ZSL: accuracy
- GZSL: tr, ts, hm
Results:
Model | acc | tr | ts | hm |
---|---|---|---|---|
DCN | 29.8 | 48.0 | 26.0 | 33.7 |
LDE | 51.7 | 61.7 | 39.9 | 48.5 |
Relation Net | 55.4 | 60.8 | 40.0 | 48.0 |
SAE | 57.7 | 63.7 | 38.7 | 48.1 |
LLINet | 56.7 | 68.6 | 39.4 | 50.1 |
LLINet(VGGish) | 62.2 | 68.7 | 41.7 | 51.9 |