Design and implementation of a full stack automated software testing platform based on large language models
Zhang, Run (2026)
Kandidaatintyö
Zhang, Run
2026
School of Engineering Science, Tietotekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2026052856716
https://urn.fi/URN:NBN:fi-fe2026052856716
Tiivistelmä
Context: Software testing is the key step to ensure software quality but remains labour-intensive and difficult to scale. Existing automated software testing tools mostly generate test cases based on rigid rule-based generation, which is difficult to capture the behaviour in application layer, while current Large Language Model (LLM) approaches often lack execution-driven validation.
Objective: This study designs and implements a full stack automated software testing platform based on LLM. The platform can achieve a close loop of software test cases generation, validation, and fix.
Method: The platform is built on a four-stage multi-agent pipeline. The first stage summarises each file and groups them into business domain clusters. Then, the planner translates each cluster into a detailed test plan. The third stage scores each task to guide model selection. In the final stage, the specialised agents generate, execute, and repair test code autonomously. Evaluation was conducted on two open-source projects.
Result: On the Flask tutorial application, this pipeline completed in 26 minutes, achieving a 66.7% task success rate and 91% code coverage. A supplementary experiment on Spring PetClinic showed that task success rate drops to 12.5% in more complex Java environments. The primary failure causes were framework API version mismatches and dependency configuration issues rather than deficiencies in the generation mechanism itself.
Conclusion: The result illustrates that adopting staged and multi-agent decomposition with execution-driven repair is a viable approach to autonomous test generation.
Objective: This study designs and implements a full stack automated software testing platform based on LLM. The platform can achieve a close loop of software test cases generation, validation, and fix.
Method: The platform is built on a four-stage multi-agent pipeline. The first stage summarises each file and groups them into business domain clusters. Then, the planner translates each cluster into a detailed test plan. The third stage scores each task to guide model selection. In the final stage, the specialised agents generate, execute, and repair test code autonomously. Evaluation was conducted on two open-source projects.
Result: On the Flask tutorial application, this pipeline completed in 26 minutes, achieving a 66.7% task success rate and 91% code coverage. A supplementary experiment on Spring PetClinic showed that task success rate drops to 12.5% in more complex Java environments. The primary failure causes were framework API version mismatches and dependency configuration issues rather than deficiencies in the generation mechanism itself.
Conclusion: The result illustrates that adopting staged and multi-agent decomposition with execution-driven repair is a viable approach to autonomous test generation.
