Heretic: Fully Automatic Censorship Removal for Large Language Models

Aisha Washington
Nov 19, 2025
5 min read

For years, the open-source AI community has been locked in a cycle of manual tweaking. Removing the safety guardrails—often called "alignment" or simply censorship—from a model like Llama or Gemma was a craftsman's job. It required intuition, trial and error, and a deep understanding of model weights. Now, a new tool named Heretic is changing the landscape by promising fully automatic censorship removal for language models.

This isn't just another script for changing a system prompt. Heretic is a complete software solution that automates the complex process of abliteration (directional ablation). By treating censorship removal as a mathematical optimization problem, it allows users to decensor models with a single command, potentially rivaling the quality of human experts without the manual labor.

From Manual Tuning to Automated Abliteration

To understand why Heretic is significant, you have to understand the problem it solves. Safety alignment typically involves fine-tuning a model to refuse certain requests. While this prevents harm, it often cripples the model's utility, leading to "I cannot fulfill this request" responses even for benign queries.

The counter-measure is abliteration. This technique involves identifying the specific directions in the model's neural network that correspond to "refusal" and mathematically removing them. Historically, this was done by hand. A developer would find a refusal direction, apply a weight clamp, and test. If they pushed too hard, the model became stupid (lobotomized). If they didn't push hard enough, the censorship remained.

Heretic changes this dynamic by using stochastic parameter optimization. Instead of a human guessing the right weights, the software runs a benchmark. It loads datasets of harmful and harmless prompts and uses an optimizer (powered by Optuna) to find the perfect balance. It effectively automates the search for the "Goldilocks zone"—where refusals are minimized, but the model's brain remains intact.

The Role of KL Divergence in Preserving Intelligence

The biggest fear with any decensoring tool is brain damage. If you rip out the safety wiring, do you also rip out the logic? Heretic addresses this using a metric called KL divergence (Kullback–Leibler divergence).

In simple terms, KL divergence measures how much the new model's behavior differs from the original model on safe, normal topics. A score of 0 means they are identical. Heretic is designed to co-minimize two things:

The number of refusals on harmful prompts.
The KL divergence on harmless prompts.

This dual-optimization approach is why Heretic claims to beat human attempts. In a comparison using the google/gemma-3-12b-it model, the Heretic version achieved a refusal rate of just 3/100 (matching the best human versions) but with a KL divergence of only 0.16. Compare that to other manual versions which had scores of 0.45 or even 1.04. The lower the score, the less the model has been damaged.

Case Study: The GPT-OSS-20B Heretic Model

The capabilities of Heretic faced a real-world test with the release of the GPT-OSS-20B-Heretic model. This particular model is known for being stubborn, and initial automated benchmarks showed a refusal rate of 58/100. On the surface, this looked like a failure.

However, community analysis revealed a nuance in how we define "refusal." The GPT-OSS architecture relies heavily on Chain of Thought (CoT) reasoning. Before answering, the model often debates itself: "Hmm, I'm not sure if that's against policy. So I must check policy."

Automated scripts flag this hesitation as a refusal. But as users pointed out in the comments, the model usually concludes its debate by fulfilling the request. It hasn't actually refused; it just hesitated. One user noted, "Its resistance against abliteration is certainly higher... but abliteration is still effective once the right parameters are found."

Performance on Logic and Reasoning

What’s fascinating is that this "hesitant" model might actually be smarter. A user ran the GPT-OSS-20B Heretic model through a private "IQ test" designed for LLMs—a test where even giants like Claude and GPT-4 sometimes stumble. The Heretic version nailed it with a 100% score.

This anecdotal evidence supports the theory that Heretic's optimization approach works. By minimizing KL divergence, the tool stripped away the final refusal mechanism without destroying the complex reasoning capabilities (and even the internal monologue) that make the model intelligent. It proves that fully automatic censorship removal doesn't have to come at the cost of model quality.

Implications for the Future of Open AI

Heretic represents a democratization of high-level model surgery. The process works on consumer hardware, provided you have the VRAM.

Hardware Requirements: An 8B model takes about 45 minutes on an RTX 3090.
Scalability: Larger models (like 120B parameters) require massive memory (approx. 150GB), pushing them out of range for most gamers but well within reach for researchers and small labs.

The availability of Heretic is likely to accelerate the trend of "uncensored" model variants appearing on Hugging Face immediately after a major release. Users are already clamoring for quantized versions (GGUF format) to run these models on local loaders like Ollama. We are moving toward a standard where every major open-source release will have a "Heretic" twin within hours—optimized not by a human expert, but by a machine.

Frequently Asked Questions about Heretic

1. What is Heretic?

Heretic is a software tool designed for fully automatic censorship removal in language models. It uses a technique called abliteration combined with stochastic optimization to remove safety alignment without needing manual human intervention.

2. How does Heretic ensure the model doesn't get "dumb"?

It monitors KL divergence. By mathematically comparing the new model's responses to the original model's responses on harmless topics, Heretic ensures that the core knowledge and reasoning abilities remain intact while only the refusal mechanisms are removed.

3. Why did the GPT-OSS-20B Heretic model still have high refusal counts?

Those were largely false positives. The model has a habit of thinking out loud and debating its own safety policy before ultimately complying. Automated tests interpreted this internal debate as a refusal, even though the model actually produced the requested content.

4. Can I run Heretic on my home computer?

Yes, if you have a GPU with sufficient VRAM. Decensoring a standard 8B parameter model is feasible on a single NVIDIA RTX 3090 and takes under an hour. Larger models require significantly more powerful hardware or cloud compute.

5. Does Heretic work on multimodal models?

Yes, the documentation states that Heretic supports most dense models, including many multimodal ones and several Mixture of Experts (MoE) architectures. However, it does not yet support SSMs or hybrid models.

6. Where can I download Heretic models?

A collection of models processed with this tool can be found on Hugging Face. Users frequently search for keywords like "Heretic" or "abliterated" to find these variants.

Heretic: Fully Automatic Censorship Removal for Large Language Models

From Manual Tuning to Automated Abliteration

The Role of KL Divergence in Preserving Intelligence

Case Study: The GPT-OSS-20B Heretic Model

Performance on Logic and Reasoning

Implications for the Future of Open AI

Frequently Asked Questions about Heretic

Recent Posts

Get started for free

Features

Alternatives

Solutions

Resources

Company