Evaluating the accuracy of large language models for text summarization in Finnish
Ataei, Maryam (2025)
Diplomityö
Ataei, Maryam
2025
School of Engineering Science, Laskennallinen tekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025052251458
https://urn.fi/URN:NBN:fi-fe2025052251458
Tiivistelmä
This thesis investigates the performance of large language models in Finnish text summarization by comparing three models: Poro, DeepSeek, and OpenAI o3mini. Given the high degree of morphological inflection and the relatively free word order characteristic of Finnish, conventional evaluation metrics often fail to capture content equivalence. To address this, the study adopts chrf, bertscore, and cosine similarity metrics demonstrated to be more reliable for evaluating lexical and semantic similarity in low-resource, morphologically complex languages.
The evaluation is conducted on a curated set of Finnish academic articles. Each model-generated summary is compared against the corresponding human-written abstract. Furthermore, the impact of prompt engineering is examined by testing three prompt variations, ranging from generic to domain-specific, including expert-role conditioning. Empirical results show that DeepSeek consistently achieves the highest performance. OpenAI o3mini performs moderately well, while Poro underperforms across all metrics. The structured prompt incorporating expert role instructions yields the best results across all models, highlighting the importance of prompt formulation in low-resource contexts.
These findings suggest that multilingual large language models, when guided with carefully designed prompts, can outperform smaller domain-specific models in Finnish summarization tasks. The study underscores the value of prompt engineering as a cost-effective strategy to enhance the performance of general-purpose models without additional fine-tuning.
The evaluation is conducted on a curated set of Finnish academic articles. Each model-generated summary is compared against the corresponding human-written abstract. Furthermore, the impact of prompt engineering is examined by testing three prompt variations, ranging from generic to domain-specific, including expert-role conditioning. Empirical results show that DeepSeek consistently achieves the highest performance. OpenAI o3mini performs moderately well, while Poro underperforms across all metrics. The structured prompt incorporating expert role instructions yields the best results across all models, highlighting the importance of prompt formulation in low-resource contexts.
These findings suggest that multilingual large language models, when guided with carefully designed prompts, can outperform smaller domain-specific models in Finnish summarization tasks. The study underscores the value of prompt engineering as a cost-effective strategy to enhance the performance of general-purpose models without additional fine-tuning.