Natural Language Processing and Rule Extraction for Document Analysis: An Analysis on NLP Techniques for Information Extraction and python implementation
Thesis studies an automated letter template generator for KWSOFT. Limited data led to unsupervised methods, while supervised techniques and language models show promise. The programmed rule engine could achieve savings of 7.04%, while the proposed extensions, could save around 28.16% .
Leonardo Bollazzi, 2023
Bachelor Thesis, KWSOFT
Betreuende Dozierende: Stephan Jüngling
Keywords: NLP, AI, Document Management, Association analysis, Clustering
Views: 17 - Downloads: 1
Present situation: Companies using legacy systems encounter challenges in reaching their objectives due to limited features and outdated technology. Thats why, the project's primary aim is to develop a easy solution for KWSOFT's customers to generate letter templates. The client aims to automate template creation by leveraging AI techniques to extract rules and patterns by analysing already existing letters and their corresponding XML files. The application should be able to differentiate between:
• Variable text
• Input fields
• Rules (Components dependant on data properties)
The Project comprises a feasibility study and proof of concept.
The feasibility study involved extensive research on supervised and unsupervised ML methods, generative models, association analysis, and preprocessing techniques. This research aimed to assess the optimal solution in terms of benefits and risks while devising risk mitigation measures.
In the proof of concept, Python code was developed to implement the researched techniques, validating the feasibility determined earlier. The proof of concept aimed to demonstrate the practical application of the proposed solution.
The research explored supervised and unsupervised ML methods, as well as generative language models. Limited access to labeled data hindered the use of supervised methods, while language models lacked sufficient training data. Unsupervised methods were employed, but challenges arose in handling outliers and identifying letter variations.
The proof of concept had two stages. The first used K-means clustering to cluster paragraphs, optimizing K with silhouette scores and employing preprocessing techniques. The second stage involved rule extraction using the Apriori method and introduced a novel concept for recognizing categorical values.
Economic feasibility analysis showed potential cost savings of 7.04 with the current rule engine. The study presented recommendations for future improvements, including an extended rule engine capable of handling variations, user interface enhancements, language model prototype evaluation, and implementing supervised techniques. The extended version demonstrated potential cost savings of approximately 28.16%.
The study also proposed exploring parallelization concepts using tools like the MPI4py library for further advancements.
Studiengang: Business Information Technology (Bachelor)
Fachbereich der Arbeit: