CNN-transformer for face image super-resolution
Zhang, Mengfei (2025)
Kandidaatintyö
Zhang, Mengfei
2025
School of Engineering Science, Tietotekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025051442644
https://urn.fi/URN:NBN:fi-fe2025051442644
Tiivistelmä
Face image super-resolution (FISR) technology enhances image quality and detail, improving the performance and reliability of facial recognition tasks. It is widely utilized in various fields, such as video surveillance systems, biometrics, film and digital media. Existing CNN-based and Transformer-based methods have problems of difficulty in co-optimizing detail recovery and global structure preservation in FISR tasks, as well as imbalance between computational efficiency and reconstruction accuracy. In this thesis, a real-time FISR model named CNN-T is introduced, which leverages a hybrid architecture combining CNN and Transformer. The goal of this thesis is to maintain the consistency of facial anatomical structure while enhancing local high-frequency details, and to optimize the real-time performance of FISR.
The proposed CNN-T introduces Local-Global Feature Cooperation Module (LGCM) to extract multi-level features in a coordinated manner: Facial Structure Attention Unit (FSAU) is used to focus on the details of the facial features, and the Transformer is used to establish long-distance dependencies. The combination of the two realizes the coordination and unification of local details and global structures. Multi-Dconv Head Transposed Attention (MDTA) is introduced into the Transformer, combining depthwise separable convolution and point-by-point convolution to reduce computational complexity and achieve real-time effects. To access the performance of CNN-T, extensive experiments are carried out on CelebA and Helen datasets. It is benchmarked against several previous methods using both quantitative metrics and visual effects.
The proposed CNN-T introduces Local-Global Feature Cooperation Module (LGCM) to extract multi-level features in a coordinated manner: Facial Structure Attention Unit (FSAU) is used to focus on the details of the facial features, and the Transformer is used to establish long-distance dependencies. The combination of the two realizes the coordination and unification of local details and global structures. Multi-Dconv Head Transposed Attention (MDTA) is introduced into the Transformer, combining depthwise separable convolution and point-by-point convolution to reduce computational complexity and achieve real-time effects. To access the performance of CNN-T, extensive experiments are carried out on CelebA and Helen datasets. It is benchmarked against several previous methods using both quantitative metrics and visual effects.
