#5841. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering

July 2026publication date
Proposal available till 17-05-2025
4 total number of authors per manuscript0 $

The title of the journal is available only for the authors who have already paid for
Journal’s subject area:
Computer Vision and Pattern Recognition;
Signal Processing;
Places in the authors’ list:
place 1place 2place 3place 4
FreeFreeFreeFree
2350 $1200 $1050 $900 $
Contract5841.1 Contract5841.2 Contract5841.3 Contract5841.4
1 place - free (for sale)
2 place - free (for sale)
3 place - free (for sale)
4 place - free (for sale)

Abstract:
Visual Question Answering (VQA) is a multi-modal challenging task that accepts an image and a natural language question about that image as inputs and desires to find the correct answer. This AI-complete task necessitates the fine-grained joint understanding of the two input modalities. Inspired by the success of attention mechanism in the task of efficient comprehension of visual-language features for VQA, this paper proposes a Multi-Tier Attention Network (MTAN) with the major component being term-weighted question-guided visual attention. Additionally, we introduce a novel Supervised Term Weighting (STW) scheme named ‘qf.obj.cos’ to semantically weight words utilizing the notion of visual object detection. This can be generalized to other vision-language comprehension tasks like image captioning, text-to-image-retrieval, multi-modal summarization etc. In effect, the proposed system allows the generation of more discriminative visual features from the progressive steps of question guided visual attention where question embedding is indeed guided by semantic term weighting. MTAN is quantitatively and qualitatively evaluated on the benchmark DAQUAR dataset and an extensive set of ablations are studied to demonstrate the individual significance of each of the components of the system.
Keywords:
Attention mechanism; Deep learning; Semantic similarity; Supervised term weighting; Visual Question Answering

Contacts :
0