Multimodal large language model on human motion understanding
Li, Xinrui (2025)
Diplomityö
Li, Xinrui
2025
School of Engineering Science, Laskennallinen tekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025062573676
https://urn.fi/URN:NBN:fi-fe2025062573676
Tiivistelmä
The field of human action understanding (HAU) faces the great challenge of generating natural and semantically accurate language descriptions from 3D motion data, which is often due to the quality limitations of existing dataset annotations and the templated nature of model outputs. This thesis aims to improve the performance of motion-to-text (M2T) models by proposing a framework that focuses on improving the quality of training data and the controllability of the language generation process. This thesis introduces a Semantically-Driven and Instruction-Tuned Motion-Integrated Language Solver (MILS++), which is based on the Multimodal Iterative LLM Solver (MILS). First, a large language model (LLM) is used to automatically optimize the text descriptions in the HumanML3D dataset to create a high-quality corpus. Then, the motion data encoded by the motion tokenizer is combined with the optimized text into a unified token sequence. Finally, a pre-trained text-to-text Transformer (T5) model is fine-tuned on this unified data using natural language instructions to guide the generation process. The experiments on the HumanML3D benchmark show that the proposed MILS++ achieves consistent improvements over the MotionGPT baseline on a range of evaluation metrics, including RPrecision, CIDEr, and BERTScore. The results confirm that improving the quality of text descriptions and adopting instruction-based fine-tuning are effective and low-cost strategies to improve the accuracy and naturalness of action-based language generation, and lay a scalable foundation for future research in the field of multimodal action understanding.