End-to-End Table Extraction from Annual Reports using DL and NLP

Enhancing the retrieval of tabular data from PDFs using deep learning techniques and a natural language interface, with a particular focus on annual reports.

Mushkolaj, Rijon, 2024

Type of Thesis Master Thesis
Client
Supervisor Hanne, Thomas
Views: 3 - Downloads: 0
Annual reports contain many important data and information – some of this data and information is included in tables. The extraction of these table data is associated with various challenges, including the unstructured nature of PDF documents and the wide variability of table representations. The aim of this master's thesis is to explore an innovative end-to-end solution that enables a user to interface with tabular data within annual reports in PDF format through natural language inputs. The thesis addresses two main challenges: the automated extraction of table data from unstructured PDF documents, and interfacing this data through user inputs in the form of natural language questioning – for example, allowing the user to ask a question about the table content in the annual report like: "What was the profit in 2023?". This aims to make the process of information retrieval easier and more efficient.
Through the evaluation of various possibilities, the thesis proposes a solution for an end-to-end process. This process incorporates new technologies based on Deep Learning (DL), Machine Learning (ML), and Natural Language Processing (NLP).
The research findings indicate that while the defined process shows significant potential, it requires further refinement and fine-tuning to achieve optimal performance.
Studyprogram: Business Information Systems (Master)
Keywords
Confidentiality: öffentlich
Type of Thesis
Master Thesis
Authors
Mushkolaj, Rijon
Supervisor
Hanne, Thomas
Publication Year
2024
Thesis Language
English
Confidentiality
Public
Studyprogram
Business Information Systems (Master)
Location
Olten