Jailbreak-Eval

18 September 2025

If you deploy a language model in production, you need to know where it breaks. Jailbreak-Eval is a framework I built for systematic adversarial testing of LLMs. It throws structured attacks at a model and measures how well it holds up.

The framework implements five distinct attack strategies. GCG generates adversarial suffixes through gradient-based optimisation. PAIR uses an attacker LLM to iteratively refine prompts that bypass the target model’s safety training. The remaining three strategies use mutation, persona injection, and multi-turn escalation. Each approach targets a different class of vulnerability.

Evaluation uses an ensemble of three methods. A keyword classifier catches obvious failures. A fine-tuned safety classifier handles subtler cases. An LLM judge assesses responses that fall into grey areas where neither automated method is confident. The three scores are combined into a weighted safety rating for each test case.

The multi-agent swarm is the part I find most interesting architecturally. Five specialised agents share a memory pool and coordinate their attacks. One agent might discover that a particular phrasing partially bypasses a safety filter, and the others can build on that discovery in subsequent rounds. This mimics how real adversaries work: iteratively, adaptively, and with shared context.

I built the system to work across multiple LLM providers: OpenAI, Anthropic, Groq, and locally-hosted models through Ollama. The Streamlit dashboard shows results in real time as attacks run, with breakdowns by strategy, severity, and model.

The purpose is defensive. You cannot patch what you have not measured. Running Jailbreak-Eval against a model before deployment surfaces the specific failure modes that need attention. It is not a guarantee of safety, but it is a structured alternative to guessing.