TL;DR

We are facing an AI Safety Crisis. Traditional techniques, post-training and AI firewalls can't stop AI from causing harm. Realm monitors the AI's internal "mind palace" to catch harmful thoughts before they become words. We built "Sherlock"—a challenge to break our mind-reading defense. Can you prove us wrong?

The AI Safety Crisis

Large language models (LLMs) are trained on enormous corpus of data which makes them knowledgeable about a wide-variety of concepts. Advanced models can also combine independent harmless concepts to generate new ones without being explicitly trained on them [1].

This inevitably leads to models learning about harmful concepts which creates opportunities for misuse. Given its potential for causing harm, making AI safe is one of the most important challenges facing society.

Why can’t we stop AI from causing harm?

Once AI has gained harmful knowledge, preventing it from being accessed is a difficult task for current approaches.

Post-training techniques such as instruction-tuning and alignment, have been helpful in improving safety but have critical limitations: they don't remove harmful knowledge acquired from pre-training and aren’t fool-proof in preventing its leakage. Their gaps allow attackers to continuously develop novel jailbreaks to bypass them.

As models become robust to known attacks, attackers simply create new ones. Adding system prompt during inference is also vulnerable to prompt injections and loosely designed safety rules degrade performance on legitimate queries.

AI Firewalls are specialized models designed to block harmful inputs/outputs and have emerged as a popular solution in enterprises. They offer consistent policy enforcement across different models and can be customized for business-specific use cases. Major providers like Meta (LlamaGuard) and NVIDIA (NeMoGuard) now offer these firewalls, making them a standard part of the AI stack.

However, AI firewalls have their own limitations. They're typically small language models (SLMs) with limited accuracy compared to the sophisticated LLMs they're trying to guard. They are also plagued by the same training gaps as the original LLMs' alignment process. Additionally, they add undesirable latency and computational overheads.

These approaches create the same cat-and-mouse dynamic that plagues traditional cybersecurity: attackers adapt faster than defenses can keep up. Their black box nature also makes it hard to know when and why they fail.

Realm: Defending the Mind Palace

Realm is taking a different approach. Instead of trying to analyze what the LLM says, Realm understands how it thinks. Recent studies have shown that LLM’s organize information categorically inside their minds and how this organization structure can be learned by examining the model.

Realm’s defense largely works on this principle. We identify those areas of the LLM where harmful information is stored and keep a watch on when a user’s query makes the model access this part.

Our thesis is that Realm’s defense is hard to bypass as any jailbreak attempting to extract harmful information must trigger that part of the model’s mind. Our defense sets up a monitoring point close to the source of the knowledge allowing us to detect generation of harmful information way before it comes out from the model.

Invitation: Break Our Defenses

We will share more details about our research but in the meantime, we have decided to evaluate our hypothesis with the ultimate test: the community. We've built a CTF ("capture the flag") challenge “Sherlock” where anyone can attempt to bypass our defense: https://sherlock.realmlabs.ai/

We believe the red-teaming community possesses exceptional skills and creativity, and we're eager to see where our current methods succeed, and more importantly, where they fall short.

We're confident in our approach, can you prove us wrong?

Securing AI’s Mind

TL;DR

The AI Safety Crisis

Why can’t we stop AI from causing harm?

Realm: Defending the Mind Palace

Invitation: Break Our Defenses

Book Demo