Evaluating automated approaches for detecting privacy regulation non-compliance
Desai, Devarsh (2025)
Diplomityö
Desai, Devarsh
2025
School of Engineering Science, Tietotekniikka
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2025062674147
https://urn.fi/URN:NBN:fi-fe2025062674147
Tiivistelmä
With the introduction of privacy policy regulations like GDPR and CCPA, the policy documents of the organisations have because lengthy and legally complex. It raises the transparency challenges for the users, and even the compliance auditors. Thus, the need for an automated policy compliance solution rises. This thesis tests the ability of the natural language processing (NLP) models in classifying the policy segments and flagging the segments with high risks. This project uses two different datasets OPP-115, and MAPP corpus with 179 privacy policies combined, and trains and test two different NLP models TF-IDF + Regression Testing, and BERT on those datasets.
The core objective of the models was to analyse the policies and classify each segment of the policy into either of these three categories which are first-party, third-party, and both. The process of evaluation also includes cross-validation, and risk flagging. The results of the tests show that BERT provides consistently better results than TF-IDF for all the provided tasks. Although BERT requires significantly more computing time and resources than TF-IDF.
The results also show that although the context independent model like BERT has the potential to support the legal and compliance workflows, tasks like cross-dataset generalisation still remain as a big challenge for this project, and the selected models.
The core objective of the models was to analyse the policies and classify each segment of the policy into either of these three categories which are first-party, third-party, and both. The process of evaluation also includes cross-validation, and risk flagging. The results of the tests show that BERT provides consistently better results than TF-IDF for all the provided tasks. Although BERT requires significantly more computing time and resources than TF-IDF.
The results also show that although the context independent model like BERT has the potential to support the legal and compliance workflows, tasks like cross-dataset generalisation still remain as a big challenge for this project, and the selected models.