SkillsAgent Pipeline

The primary objective of this research is to develop an AI-based pipeline, named "SkillsAgent," that transforms unstructured content from PDF documents into structured skills data, laying the groundwork for future integration into Scrambl’s AI-driven job matching and workforce development system.

Jakob Koller, 2025

Art der Arbeit Bachelor Thesis
Auftraggebende Scrambl. AG
Betreuende Dozierende Richards, Bradley, Pustulka, Elzbieta
Views: 4
Scrambl and the University of Applied Sciences and Arts Northwestern Switzerland (FHNW) launched a joint Innosuisse project to develop AI-driven tools for strategic onboarding and workforce development in regulated sectors. This thesis supports the initiative by building “SkillsAgent,” a pipeline that extracts structured skills data from Swiss vocational training documents, such as those listed on the official BECC platform (https://www.becc.admin.ch/becc/public/bvz/).
The development followed an iterative, prototype-driven approach. A process and system architecture diagram illustrated the pipeline’s structure and target state. Initial prototypes used local LLMs (deepseek-coder-v2:16b, llama3.1:8b), later replaced by the managed GPT-4.1-mini. Core components such as PDF parsing, content chunking, vector embedding, and prompt engineering were analysed in depth. After stabilising the pipeline on individual cases, it was tested with 20 vocations to evaluate scalability and consistency.
The project revealed key challenges in parsing PDF content due to inconsistent document structures. To overcome this, Adobe’s PDF Extract API was used to analyse layout and structure, enabling the segmentation of content into meaningful, context-aware chunks. These chunks were matched using vector similarity and processed by a Large Language Model (LLM), guided by structured prompts to extract relevant skills data accurately and consistently. A comparison of local LLMs (e.g. deepseek-coder-v2:16b, llama3.1:8b) and managed models (GPT-4.1-mini) showed that while the tested local models are well-suited for early prototyping, managed models offer more reliable reasoning and better multilingual support. The pipeline was first sequentially tested on 20 vocational training nodes and then successfully executed on 541 vocational training nodes, resulting in the extraction of 11,102 structured skills. Overall, the project confirms the technical feasibility of using Spring Boot and Spring AI to build an agent that extracts structured skills data from unstructured PDFs. The final solution requires minimal input—such as a PDF URL—to generate structured knowledge for downstream use.
Studiengang: Business Information Technology (Bachelor)
Keywords Artificial Intelligence, AI Agent, Spring Boot, SpringAI, Large Language Model, Prompt Engineering, PDF Parsing, Content Chunking, Vector Embedding
Vertraulichkeit: öffentlich
Art der Arbeit
Bachelor Thesis
Auftraggebende
Scrambl. AG, Kriens
Autorinnen und Autoren
Jakob Koller
Betreuende Dozierende
Richards, Bradley, Pustulka, Elzbieta
Publikationsjahr
2025
Sprache der Arbeit
Englisch
Vertraulichkeit
öffentlich
Studiengang
Business Information Technology (Bachelor)
Standort Studiengang
Basel
Keywords
Artificial Intelligence, AI Agent, Spring Boot, SpringAI, Large Language Model, Prompt Engineering, PDF Parsing, Content Chunking, Vector Embedding