Micro-gesture recognition via visual-text contrastive learning
Onyando, Socrates Waka (2024)
Diplomityö
Onyando, Socrates Waka
2024
School of Engineering Science, Laskennallinen tekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2024060342462
https://urn.fi/URN:NBN:fi-fe2024060342462
Tiivistelmä
Emotional artificial intelligence is a growing field that aims to enable machines with the ability to understand and respond to human emotions. Micro-gestures have been shown to be highly indicative of one’s true emotions. This is in comparison with other modalities such as speech, facial expressions, and ordinary gestures which can be easily manipulated to hide one’s real emotions. Recognizing micro-gestures is therefore crucial for building machines that can accurately detect human emotions. Most existing micro-gesture recognition methods often rely on unimodal visual data and overlook accompanying textual information. While text labels can be helpful, they can be expensive to acquire and might miss the finer details crucial for understanding micro-gestures given that they represent high-level categories. This work proposes an approach for micro-gesture recognition that leverages the power of recent advancements in pre-trained multimodal models. This approach utilises automatic text descriptions, generated by captioning models, to complement the visual data in a visual-text contrastive learning framework that can be used in both supervised and self-supervised settings. This approach achieves state-of-the-art performance on the iMiGUE dataset and demonstrates its viability in recognizing microgestures.
