Augmenting LLMs to Securely Retrieve Information for Corporate Real Estate Management

Over the past few years, the field of generative AI has seen remarkable progress. The emergence of the transformer architecture has drastically altered the landscape. This innovation has facilitated the creation of highly advanced language models that excel in generating text, summarizing content, and translating languages with impressive accuracy.

Krütli, David, 2024

Art der Arbeit Master Thesis
Auftraggebende
Betreuende Dozierende Hanne, Thomas
Keywords
Views: 3 - Downloads: 0
This master’s thesis explores the enhancement of large language models for secure information retrieval within the context of corporate real estate management. Facility managers often face challenges accessing critical data scattered across various documents, including manuals and operation instructions. This thesis introduces a retrieval-augmented generation system tailored to the dynamic needs of facility management, aiming to provide instant, accurate access to essential information.
The proposed system integrates advanced techniques from natural language processing and information retrieval paradigms. Specifically, the implementation leverages the Mixtral 8x7B model for multilingual processing and the Milvus vector database for efficient storage and retrieval of documents. The implementation process includes steps such as indexing, chunking, enriching, and embedding document texts to facilitate effective retrieval. The dataset used for this thesis, provided by FHNW, includes over 2500 documents in both structured and unstructured formats. These documents, which cover 12 different facilities, consist of images, operation manuals, inspection results, blueprints, technical drawings, and more in various file formats. This diverse dataset reflects the variety of information encountered in corporate real estate management. The system’s performance was evaluated with queries related to three of the 12 facilities. The evaluation involved generating 30 question-answer pairs pertinent to facility management tasks and assessing the system’s performance using metrics such as ROUGE, BLEU, and semantic similarity. The methodology combined automated metrics with human-like assessments to gauge the accuracy, relevance, and coherence of the responses. The results showed high semantic similarity, demonstrating the system’s ability to understand and generate relevant content, although some variability was observed in lexical precision. This work contributes to the field by addressing the gap between the capabilities of pre-trained language models and the specific confidentiality requirements of corporate data repositories, potentially setting a benchmark for future applications in similar domains.
The findings suggest that retrieval-augmented generation systems can significantly enhance operational efficiency by reducing the time and effort required to access information while maintaining high security and data privacy standards.
Studiengang: Business Information Systems (Master)
Vertraulichkeit: öffentlich
Art der Arbeit
Master Thesis
Autorinnen und Autoren
Krütli, David
Betreuende Dozierende
Hanne, Thomas
Publikationsjahr
2024
Sprache der Arbeit
Englisch
Vertraulichkeit
öffentlich
Studiengang
Business Information Systems (Master)
Standort Studiengang
Olten