Video Captioning with Commonsense Knowledge Anchors
Description
It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain
for a video captioning system to generate natural language descriptions focusing on
the prominent interest and aligning with the latent aspects beyond observations. This
work presents a Commonsense knowledge Anchored Video cAptioNing (dubbed as
CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the
training of video captioning model with a novel paradigm for sentence-level semantic
alignment. Specifically, commonsense knowledge is queried to complement per training
caption by querying a generic knowledge atlas ATOMIC, and form the commonsense-
caption entailment corpus. A BERT based language entailment model trained from
this corpus then serves as a commonsense discriminator for the training of video
captioning model, and penalizes the model from generating semantically misaligned
captions. With extensive empirical evaluations on MSR-VTT, V2C and VATEX
datasets, CAVAN consistently improves the quality of generations and shows higher
keyword hit rate. Experimental results with ablations validate the effectiveness of
CAVAN and reveals that the use of commonsense knowledge contributes to the video
caption generation.
Date Created
The date the item was original created (prior to any relationship with the ASU Digital Repositories.)
2022
Agent
- Author (aut): Shao, Huiliang
- Thesis advisor (ths): Yang, Yezhou
- Committee member: Jayasuriya, Suren
- Committee member: Xiao, Chaowei
- Publisher (pbl): Arizona State University