Towards Robust VQA: Evaluations and Methods

Jyothi Unni, Suraj

Visual Question Answering (VQA) is an increasingly important multi-modal task where models must answer textual questions based on visual image inputs. Numerous VQA datasets have been proposed to train and evaluate models. However, existing benchmarks exhibit a unilateral focus on…

Visual Question Answering (VQA) is an increasingly important multi-modal task where models must answer textual questions based on visual image inputs. Numerous VQA datasets have been proposed to train and evaluate models. However, existing benchmarks exhibit a unilateral focus on textual distribution shifts rather than joint shifts across modalities. This is suboptimal for properly assessing model robustness and generalization. To address this gap, a novel multi-modal VQA benchmark dataset is introduced for the first time. This dataset combines both visual and textual distribution shifts across training and test sets. Using this challenging benchmark exposes vulnerabilities in existing models relying on spurious correlations and overfitting to dataset biases. The novel dataset advances the field by enabling more robust model training and rigorous evaluation of multi-modal distribution shift generalization. In addition, a new few-shot multi-modal prompt fusion model is proposed to better adapt models for downstream VQA tasks. The model incorporates a prompt encoder module and dual-path design to align and fuse image and text prompts. This represents a novel prompt learning approach tailored for multi-modal learning across vision and language. Together, the introduced benchmark dataset and prompt fusion model address key limitations around evaluating and improving VQA model robustness. The work expands the methodology for training models resilient to multi-modal distribution shifts.

Copyright Statement