Full metadata
Title
Language Image Transformer
Description
Humans perceive the environment using multiple modalities like vision, speech (language), touch, taste, and smell. The knowledge obtained from one modality usually complements the other. Learning through several modalities helps in constructing an accurate model of the environment. Most of the current vision and language models are modality-specific and, in many cases, extensively use deep-learning based attention mechanisms for learning powerful representations. This work discusses the role of attention in associating vision and language for generating shared representation. Language Image Transformer (LIT) is proposed for learning multi-modal representations of the environment. It uses a training objective based on Contrastive Predictive Coding (CPC) to maximize the Mutual Information (MI) between the visual and linguistic representations. It learns the relationship between the modalities using the proposed cross-modal attention layers. It is trained and evaluated using captioning datasets, MS COCO, and Conceptual Captions. The results and the analysis offers a perspective on the use of Mutual Information Maximisation (MIM) for generating generalizable representations across multiple modalities.
Date Created
2020
Contributors
- Ramakrishnan, Raghavendran (Author)
- Panchanathan, Sethuraman (Thesis advisor)
- Venkateswara, Hemanth Kumar (Thesis advisor)
- McDaniel, Troy (Committee member)
- Arizona State University (Publisher)
Topical Subject
Resource Type
Extent
72 pages
Language
eng
Copyright Statement
In Copyright
Primary Member of
Peer-reviewed
No
Open Access
No
Handle
https://hdl.handle.net/2286/R.I.57068
Level of coding
minimal
Note
Masters Thesis Computer Engineering 2020
System Created
- 2020-06-01 08:07:03
System Modified
- 2021-08-26 09:47:01
- 3 years ago
Additional Formats