SkillsAgent Pipeline

The primary objective of this research is to develop an AI-based pipeline, named "SkillsAgent," that transforms unstructured content from PDF documents into structured skills data, laying the groundwork for future integration into Scrambl’s AI-driven job matching and workforce development system.

Koller, Jakob, 2025

Art der Arbeit Bachelor Thesis

Auftraggebende Scrambl. AG

Betreuende Dozierende Richards, Bradley, Pustulka, Elzbieta

Views: 20 - Downloads: 5

Download

Scrambl and the University of Applied Sciences and Arts Northwestern Switzerland (FHNW) launched a joint Innosuisse project to develop AI-driven tools for strategic onboarding and workforce development in regulated sectors. This thesis supports the initiative by building “SkillsAgent,” a pipeline that extracts structured skills data from Swiss vocational training documents, such as those listed on the official BECC platform (https://www.becc.admin.ch/becc/public/bvz/).

The development followed an iterative, prototype-driven approach. A process and system architecture diagram illustrated the pipeline’s structure and target state. Initial prototypes used local LLMs (deepseek-coder-v2:16b, llama3.1:8b), later replaced by the managed GPT-4.1-mini. Core components such as PDF parsing, content chunking, vector embedding, and prompt engineering were analysed in depth. After stabilising the pipeline on individual cases, it was tested with 20 vocations to evaluate scalability and consistency.

The project revealed key challenges in parsing PDF content due to inconsistent document structures. To overcome this, Adobe’s PDF Extract API was used to analyse layout and structure, enabling the segmentation of content into meaningful, context-aware chunks. These chunks were matched using vector similarity and processed by a Large Language Model (LLM), guided by structured prompts to extract relevant skills data accurately and consistently. A comparison of local LLMs (e.g. deepseek-coder-v2:16b, llama3.1:8b) and managed models (GPT-4.1-mini) showed that while the tested local models are well-suited for early prototyping, managed models offer more reliable reasoning and better multilingual support. The pipeline was first sequentially tested on 20 vocational training nodes and then successfully executed on 541 vocational training nodes, resulting in the extraction of 11,102 structured skills. Overall, the project confirms the technical feasibility of using Spring Boot and Spring AI to build an agent that extracts structured skills data from unstructured PDFs. The final solution requires minimal input—such as a PDF URL—to generate structured knowledge for downstream use.

Studiengang: Business Information Technology (Bachelor)

Keywords Artificial Intelligence, AI Agent, Spring Boot, SpringAI, Large Language Model, Prompt Engineering, PDF Parsing, Content Chunking, Vector Embedding

Vertraulichkeit: öffentlich

Art der Arbeit

Bachelor Thesis

Auftraggebende

Scrambl. AG, Kriens

Autorinnen und Autoren

Koller, Jakob

Betreuende Dozierende

Richards, Bradley, Pustulka, Elzbieta

Publikationsjahr

2025

Sprache der Arbeit

Englisch

Vertraulichkeit

öffentlich

Studiengang

Business Information Technology (Bachelor)

Standort Studiengang

Basel

Keywords

Artificial Intelligence, AI Agent, Spring Boot, SpringAI, Large Language Model, Prompt Engineering, PDF Parsing, Content Chunking, Vector Embedding