Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video question answering task aims at reasoning over higher-level vision-language interactions. Here, not only questions about the appearance of objects are presented, as in static image question answering, but also questions regarding action and causality.

Standard models cannot analyze motion as object detection models lack temporal modeling. Therefore, a recent study proposes Motion-Appearance Synergistic Networks for video question answering.

Image credit: Cristina Zaragoza/Unsplash, free licence

The approach consists of three modules: motion, appearance, and motion-appearance fusion. Firstly, object graphs are constructed via graph convolutional networks (GCNs), and relationships between objects in each visual feature are computed. Then, cross-modal grounding is performed between the output of the GCNs and the question features. Experimental results show the effectiveness of the suggested architecture compared to other models.

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question’s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at this https URL.

Research paper: Seo, A., Kang, G.-C., Park, J., and Zhang, B.-T., “Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering”, 2021. Link:

Share This Post

Post Comment