Crowdsourcing for large-scale data labelling
Ha, Duc Thanh Duong (2025)
Kandidaatintyö
Ha, Duc Thanh Duong
2025
School of Engineering Science, Tietotekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025033122327
https://urn.fi/URN:NBN:fi-fe2025033122327
Tiivistelmä
The demand for large-scale labeled datasets have been increasing because of the rapid advancements in Artificial Intelligence (AI) and Machine Learning (ML). Crowdsourcing has become a widely adopted method for data labelling as it offers a variety of benefits such as accessibility, cost-efficiency and diverse opinions. However, the method is not without challenges. Quality control issues and management are the main problems of the method due to its natural characteristics, which depends on a large number of untrained workers. This study aims to explore the key challenges in crowdsourced data annotation from variability in task quality to the complexities of managing large-scale workforces. Addressing such issues is a must for future AI development, our study mentions quality control mechanisms, techniques like redundancy, gold-standard question, statistical methods and even AI integration. Not only that, we also investigate the performance of a tool based on the identified challenges. The information from this thesis provides readers a comprehensive understanding of the challenges related to large-scale crowdsourced data labeling, along with solutions, and tools designed to solve these issues. Our findings in case study reveals the methods that an existing tool has used to address these challenges, and yet there remains room for adequately solving and further improvement in crowdsourced data labeling.
