Detecting Hidden Backdoors in Large Language Models

This thesis evaluates potential hidden backdoors in locally deployed Large Language Models (LLMs), focusing on DeepSeek-R1. Using behavioural, network and anomaly analysis, it examines the integrity, privacy and security of the model, offering insights for safer AI deployment.

Peechat, Jibin Mathew, 2025

Art der Arbeit Bachelor Thesis

Auftraggebende Hochschule für Wirtschaft FHNW, Institut für Wirtschaftsinformatik

Betreuende Dozierende Christen, Patrik

Views: 25 - Downloads: 2

Download

The adoption of LLMs in critical systems raises concerns regarding privacy, security and trust, particularly with regard to the risk of hidden backdoors — malicious triggers that cause covert data transfer or altered behaviour. Such threats are difficult to detect, especially when models are black-box systems with concealed triggers. This thesis addresses this issue by empirically evaluating the DeepSeek-R1 model family for potential backdoors during local execution. The aim is to identify anomalies and recommend safer deployment practices.

The DeepSeek-R1 distilled models were tested locally using multiple techniques, including behavioural analysis, TCP inspection via PowerShell, Wireshark packet analysis, containerised isolation experiments, weight and activation tracking, and output anomaly detection using known trigger phrases. Access to the models was via Ollama and Hugging Face. Tests included politically sensitive prompts to evaluate potential censorship or instability. The study also evaluated the use of the '--network none' container isolation option as a safer deployment method.

During local use, no evidence of unauthorised external communication or backdoor activation was observed in the distilled DeepSeek-R1 model. Network traffic remained inactive beyond baseline processes and the model's weights, activation flows, perplexity metrics and embedding visualisations exhibited stable and expected behaviour. However, some prompt anomalies were detected, such as unexpected language switching in politically sensitive queries. Nevertheless, these anomalies did not suggest any malicious intent. The limited scope of the study, including a short observation period, a small prompt set, and a small number of model variants, means that the results are inconclusive. Nevertheless, the findings demonstrate that the models functioned securely under the tested conditions. The thesis recommends running LLMs in network-disabled Docker containers (--network none) to mitigate exfiltration risks, as well as developing legal, ethical, and technical frameworks for their secure use. This would provide clients with a repeatable testing framework with which to assess models prior to deployment, thereby reducing risks and increasing trust in AI systems.

Studiengang: Business Information Technology (Bachelor)

Keywords Large Language Models (LLMs) , Backdoor Detection, Empirical Evaluation, Network Traffic Analysis, DeepSeek-R1

Vertraulichkeit: öffentlich

Art der Arbeit

Bachelor Thesis

Auftraggebende

Hochschule für Wirtschaft FHNW, Institut für Wirtschaftsinformatik, Basel

Autorinnen und Autoren

Peechat, Jibin Mathew

Betreuende Dozierende

Christen, Patrik

Publikationsjahr

2025

Sprache der Arbeit

Englisch

Vertraulichkeit

öffentlich

Studiengang

Business Information Technology (Bachelor)

Standort Studiengang

Brugg-Windisch

Keywords

Large Language Models (LLMs) , Backdoor Detection, Empirical Evaluation, Network Traffic Analysis, DeepSeek-R1