Information Extraction from Financial Tables: Application and Evaluation of a Machine Learning Approach in Annual Reports

Publicly listed companies are legally obligated to publish their financial reports. These reports interest investors, analysts, regulators, and other stakeholders. They contain information about the company’s financial performance strategy and outlook. The financial data is typically presented in tables and is fundamental to valuation and decisionmaking, requiring high data quality. Automating financial data extraction from PDF reports, especially tables remains challenging.

Dimmler, Hans-Rudolf, 2025

Art der Arbeit Master Thesis

Auftraggebende

Betreuende Dozierende Hanne, Thomas

Views: 25 - Downloads: 36

Download

In recent years, specialized deep-learning models have demonstrated promising results in extracting table information from PDFs. In addition, multi-module solutions have been developed to process complex PDF documents and optimally align the extraction techniques to the different document components. Furthermore, Large Language Models (LLMs) have shown a comprehensive language understanding. However, the performance of these new possibilities has not yet been validated in an end-to-end process on a dataset of annual reports.

We created our dataset of eighty annual reports from large companies in North America and Europe to explore the possibilities. On this dataset, we evaluated the tasks of table detection, table information extraction, and table understanding within three experiment series. To evaluate table content extraction quality, we compared the content and structure of the table with adjacency relations. We let LLMs answer specific questions in the tables provided to evaluate the LLMs table understanding capabilities. We measured the results using the metrics precision, recall rate, and F1 score in all the experiments.

We found that the quality of the table detection task plays a crucial role in the subsequent table content extraction. The applied table content extraction on our dataset showed results that were not yet satisfying. We developed an ”LLM enhanced table content extraction” process to improve this. Combining the extracted table with a text copy from the original PDF file and prompting it to an LLM significantly improved our table information extraction experiment series’s F1 score by 27.3% from 63.95% to 91.25% on our dataset. The LLM provide in table understanding task still variability in their answers and do not yet provide satisfying results for quality sensitive financial tasks.

Studiengang: Business Information Systems (Master)

Keywords

Vertraulichkeit: öffentlich

Art der Arbeit

Master Thesis

Autorinnen und Autoren

Dimmler, Hans-Rudolf

Betreuende Dozierende

Hanne, Thomas

Publikationsjahr

2025

Sprache der Arbeit

Englisch

Vertraulichkeit

öffentlich

Studiengang

Business Information Systems (Master)

Standort Studiengang

Olten