Extraction of Table Information from Annual Reports Supported by CNN and Transformer-Based Approaches
Accurate data extraction from financial reports is essential for informed decision-making, regulatory compliance, and operational efficiency - yet automating this process remains technically challenging.
Lüthy, Elian, 2025
Art der Arbeit Master Thesis
Auftraggebende
Betreuende Dozierende Hanne, Thomas
Views: 1 - Downloads: 0
Financial tables often feature multi-level headers, grouped categories, and implicit semantics that stretch the limits of current extraction pipelines. Existing literature largely focuses on synthetic or academic datasets, leaving a methodological gap between model development and real-world application. This thesis evaluates TFLOP, a state-of-the-art table extraction model, on a curated set of native PDF annual reports from Swiss companies.
The research design combines quantitative and qualitative evaluation methods: table detection is assessed using standard benchmark metrics (precision, recall, F1-score, and accuracy), while structure recognition is evaluated based on header hierarchy, alignment, completeness, and text preservation across 140 tables. A preceding model review informed the selection of TFLOP as the most promising candidate. TFLOP achieved perfect precision (1.000) in three of the four documents, with recall ranging from 0.864 to 1.000 and F1-scores between 0.927 and 1.000. Structure recognition showed strong overall performance but revealed weaknesses when layout elements - such as loosely grouped rows - were only partially preserved. A downstream reasoning task using GPT-4 confirmed that structural degradation negatively affects interpretability in aggregation and logic-based queries. Benchmark comparisons suggest that TFLOP performs at a state-of-the-art level under clear layout conditions but still struggles with domain-specific variability found in financial disclosures. The thesis contributes an applied perspective by highlighting where current models fall short and points to layout preservation and structured evaluation as key areas for improvement.
Studiengang: Business Information Systems (Master)
Keywords
Vertraulichkeit: öffentlich