Detecting Hidden Backdoors in Large Language Models

This thesis evaluates potential hidden backdoors in locally deployed Large Language Models (LLMs), focusing on DeepSeek-R1. Using behavioural, network and anomaly analysis, it examines the integrity, privacy and security of the model, offering insights for safer AI deployment.

Jibin Mathew Peechatt, 2025

Art der Arbeit Bachelor Thesis
Auftraggebende FHNW Institute for Information Systems (IWI)
Betreuende Dozierende Christen, Patrik
Views: 10
The adoption of LLMs in critical systems raises concerns regarding privacy, security and trust, particularly with regard to the risk of hidden backdoors — malicious triggers that cause covert data transfer or altered behaviour. Such threats are difficult to detect, especially when models are black-box systems with concealed triggers. This thesis addresses this issue by empirically evaluating the DeepSeek-R1 model family for potential backdoors during local execution. The aim is to identify anomalies and recommend safer deployment practices.
The DeepSeek-R1 distilled models were tested locally using multiple techniques, including behavioural analysis, TCP inspection via PowerShell, Wireshark packet analysis, containerised isolation experiments, weight and activation tracking, and output anomaly detection using known trigger phrases. Access to the models was via Ollama and Hugging Face. Tests included politically sensitive prompts to evaluate potential censorship or instability. The study also evaluated the use of the '--network none' container isolation option as a safer deployment method.
During local use, no evidence of unauthorised external communication or backdoor activation was observed in the distilled DeepSeek-R1 model. Network traffic remained inactive beyond baseline processes and the model's weights, activation flows, perplexity metrics and embedding visualisations exhibited stable and expected behaviour. However, some prompt anomalies were detected, such as unexpected language switching in politically sensitive queries. Nevertheless, these anomalies did not suggest any malicious intent. The limited scope of the study, including a short observation period, a small prompt set, and a small number of model variants, means that the results are inconclusive. Nevertheless, the findings demonstrate that the models functioned securely under the tested conditions. The thesis recommends running LLMs in network-disabled Docker containers (--network none) to mitigate exfiltration risks, as well as developing legal, ethical, and technical frameworks for their secure use. This would provide clients with a repeatable testing framework with which to assess models prior to deployment, thereby reducing risks and increasing trust in AI systems.
Studiengang: Business Information Technology (Bachelor)
Keywords Large Language Models (LLMs) , Backdoor Detection, Empirical Evaluation, Network Traffic Analysis, DeepSeek-R1
Vertraulichkeit: öffentlich
Art der Arbeit
Bachelor Thesis
Auftraggebende
FHNW Institute for Information Systems (IWI), Basel
Autorinnen und Autoren
Jibin Mathew Peechatt
Betreuende Dozierende
Christen, Patrik
Publikationsjahr
2025
Sprache der Arbeit
Englisch
Vertraulichkeit
öffentlich
Studiengang
Business Information Technology (Bachelor)
Standort Studiengang
Brugg-Windisch
Keywords
Large Language Models (LLMs) , Backdoor Detection, Empirical Evaluation, Network Traffic Analysis, DeepSeek-R1