Evaluating enterprise product recommendation chatbot using LLM : the case of easy selection
Chakma, Kanak (2025)
Diplomityö
Chakma, Kanak
2025
School of Engineering Science, Tietotekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe20251215119533
https://urn.fi/URN:NBN:fi-fe20251215119533
Tiivistelmä
In recent years, the application of Large Language Models has been significantly growing, particularly in evaluating open-domain chatbots. However, there has been limited exploration when it comes to evaluating task-oriented chatbots, especially across different cognitive levels. Cognitive levels refer to a concept of progressively complex forms of reasoning processes. The study aimed to investigate how LLMs (GPT-4) perform as evaluators of task oriented chatbots across three cognitive levels, i.e. remember, understand and evaluate. As a case study, a product-suggestion chatbot called Easy Selection was used. In short, effectiveness and coherence were evaluated by leveraging GPT-4, with coherence measured across the three cognitive levels. Agreement between GPT-4 and human raters was quantified using Cohen’s kappa, and score relationships were analysed using Spearman’s rho. Results indicate that the GPT-4 aligned well with human evaluations at lower cognitive levels. However, agreement and correlation decrease as cognitive complexity increases, indicating limitations in evaluating higher level reasoning. In conclusion, the results show that LLMs can be effective for simpler cognitive levels, but they struggle as the complexity of reasoning increases. Further research should be conducted with a larger sample size, improved chatbot models, or different prompting techniques to identify factors that can improve the effectiveness of LLMs in evaluating task oriented chatbots across complex cognitive processes.
