A Framework for Conceptual Gesture Comprehension Toward Accessible AI for All

Kamzin, Azamat

The advancement of Artificial Intelligence (AI) and Deep Learning networks has opened doors to many applications, especially in the computer vision domain. Image-based techniques now perform at the human level or above, thanks to millions of images available for training…

The advancement of Artificial Intelligence (AI) and Deep Learning networks has opened doors to many applications, especially in the computer vision domain. Image-based techniques now perform at the human level or above, thanks to millions of images available for training and GPU-based computing power. One of the challenges with these modern approaches is their reliance on vast amounts of labeled datasets and the prohibitive cost of acquiring suitable datasets for training. Although techniques like transfer learning have been developed to allow for fine-tuning models with much smaller datasets, it remains a tedious and costly task for many applications. Another challenge is the wide-ranging deployment of such AI systems in human-facing applications and the black-box nature of current deep learning techniques. There is a need for greater transparency and a need to design systems with explainability in mind. Given the enormous impact AI may have on human lives and livelihoods, AI systems need to develop trust with their human users and provide adequate feedback. Considering the inherent challenges with modern AI techniques, this research focused on the specific case of gestural language, particularly American Sign Language (ASL), in previous work. With most of the industry's interests directed at wide public-facing applications or topics like autonomous cars and large language models (LLMs), there is a need to design frameworks and advance fundamental research for communities and applications that are vastly underserved. One such community is the Deaf and Hard of Hearing (DHH), which uses gestural languages like ASL to communicate. ASL datasets tend to be limited and expensive to collect, while being complex and used by millions of users worldwide. This dissertation presents a gesture comprehension framework that decomposes gestures into conceptual semantic trees, enabling the incorporation of human-level concepts and semantic rules to improve recognition tasks with limited data, synthesize new concepts, and enhance explainability. The framework is evaluated through zero-shot recognition of previously unseen gestures, automated feedback generation for ASL learners, and testing against military, aviation, human activity, ASL, and other gestural language datasets. The results show improved accuracy over some state-of-the-art methods without large datasets, while facilitating human-level concepts, recognizing unseen examples, generating understandable feedback, and enhancing explainability.

Copyright Statement