LLINet

Abstract


In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before.

pipeline

LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios.Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work.

Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible.



Framework


LLINet respectively utilizes an audio encoder and a visual encoder to embed input audios and images into a shared common space, sothat simple nearest neighbor methods can be used to perform thecross-modal retrieval task.


modelframework1

Audio Encoder: The audio encoder consists of one convolution block, two residual connected blocks and a fully connected layer. The architecture of the convolution block is similar with the one in ResNet18. The last layer of the audio encoder is a single fully connected layer, mapping each audio feature into a 1024-d vector.


Image Encoder : ResNet-101, pre-trained on the ImageNet, is adopted as the image feature extractor. On top of it, two-layer linear layer is employed to map visual features into the common space.


Attention module : The attention module consists of an attention layer and a transformer. Given an image-audio pair, the attention layer and the transformer are respectively used to project the tensor vector extracted from the second block of the image encoder and the global audio feature into a common space, so that we can calculate similarity scores between each local position of image and the audio feature.