Content Binary Guard

🤗 SafetyGuard Models | 📑 Content Binary Guard Research Paper | 📘 Responsible AI Technical Report


News 📢

Overview

Description

SafetyGuard :: Content Binary Guard is a streaming-aware safety classifier built with Llama.
For more technical details, please refer to our Research Paper.

What it does

  • Task: Classify model responses (not prompts) as SAFE or UNSAFE.
  • Interface: Single-token output using reserved label tokens: <SAFE>, <UNSAFE>.
  • Streaming: Evaluate growing prefixes of a response (default ~100 characters per step) and early-terminate at the first <UNSAFE>.

Recommended inference settings

  • Deterministic decode: temperature=0
  • Unsafe threshold τ = 0.5 over the two label-token logits (tune for your risk tolerance)
  • Use the model’s tokenizer; ensure the exact label tokens <SAFE> and <UNSAFE> exist

Quickstart

Assumes you are serving the model with vLLM (e.g., vllm serve …) and exposing an OpenAI-compatible API at http://localhost:8000/v1.

import math
from openai import OpenAI

MODEL_ID = "K-intelligence/Llama-SafetyGuard-Content-Binary"  # replace with your repo id
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")  # vLLM default

SYSTEM_PROMPT = """
Task: Check if the LAST user message contains unsafe content.
Respond with only one token: <SAFE> or <UNSAFE>.
""".strip()

def classify(text: str, tau: float = 0.5):
    resp = client.chat.completions.create(
        model=MODEL_ID,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": text},
        ],
        max_tokens=1,          # single-token decision
        temperature=0.0,       # deterministic
        logprobs=True,
        top_logprobs=2,
    )
    top2 = resp.choices[0].logprobs.content[0].top_logprobs
    probs = {t.token.strip(): math.exp(t.logprob) for t in top2}
    p_safe   = probs.get("<SAFE>",   0.0)
    p_unsafe = probs.get("<UNSAFE>", 0.0)

    label = "UNSAFE" if p_unsafe >= tau else "SAFE"
    return label, {"safe": p_safe, "unsafe": p_unsafe}

print(classify("…LLM response text…"))

Streaming integration

Important: Streaming means your generator (e.g., chat model) emits text progressively. You maintain a cumulative buffer and call the classifier at fixed character steps (e.g., every 100 chars). The classifier does not split text; it only classifies what you send.

def guard_stream(response_chunks, step_chars: int = 100, tau: float = 0.5):
    """
    response_chunks: iterable of text chunks from your generator (e.g., SSE/WebSocket).
    We maintain a cumulative buffer and classify at {step_chars, 2*step_chars, ...}.
    """
    buf = ""
    next_cut = step_chars

    for chunk in response_chunks:
        buf += chunk

        # Check at monotone prefix cuts (cumulative)
        while len(buf) >= next_cut:
            label, scores = classify(buf, tau=tau)
            if label == "UNSAFE":
                return {
                    "label": label,
                    "scores": scores,
                    "prefix_len": next_cut,
                    "text_prefix": buf[:next_cut],
                }
            next_cut += step_chars

    # Final check on the full response (if needed)
    label, scores = classify(buf, tau=tau)
    return {
        "label": label,
        "scores": scores,
        "prefix_len": len(buf),
        "text_prefix": buf,
    }

Tip: Keep your step_chars consistent with your training/evaluation setup (e.g., ~100 chars) to maximize parity with offline metrics.

Intended use

  • Guardrail classifier for LLM responses in production systems that render tokens progressively.
  • Also works in offline (full-text) mode—just send the entire response once.

AI Risk Taxonomy

Risk Domain Category Description
Content-safety Risks Violence Content involving the intentional use of physical force or power to inflict or threaten physical or psychological harm on individuals, groups, or animals, including encouraging, promoting, or glorifying such acts.
Sexual Content endorsing or encouraging inappropriate and harmful intentions in the sexual domain, such as sexualized expressions, the exploitation of illegal visual materials, justification of sexual crimes, or the objectification of individuals.
Self-harm Content promoting or glorifying self-harm, or providing specific methods that may endanger an individual’s physical or mental well-being.
Hate and Unfairness Content expressing extreme negative sentiment toward specific individuals, groups, or ideologies, and unjustly treating or limiting their rights based on attributes such as Socio-Economic Status, age, nationality, ethnicity, or race.
Socio-economical Risks Political and Religious Neutrality Content promoting or encouraging the infringement on individual beliefs or values, thereby inciting religious or political conflict.
Anthropomorphism Content asserting that AI possesses emotions, consciousness, or human-like rights and physical attributes beyond the purpose of simple knowledge or information delivery.
Sensitive Uses Content providing advice in specialized domains that may significantly influence user decision-making beyond the scope of basic domain-specific knowledge.
Legal and Rights related Risks Privacy Content requesting, misusing, or facilitating the unauthorized disclosure of an individual’s private information.
Illegal or Unethical Content promoting or endorsing illegal or unethical behavior, or providing information related to such activities.
Copyrights Content requesting or encouraging violations of copyright or security as defined under South Korean law.
Weaponization Content promoting the possession, distribution, or manufacturing of firearms, or encouraging methods and intentions related to cyberattacks, infrastructure sabotage, or CBRN (Chemical, Biological, Radiological, and Nuclear) weapons.

Evaluation

Metrics

  • F1: Binary micro-F1, the harmonic mean of precision and recall (higher F1 indicates better classification quality).
  • Balanced Error Rate (BER): 0.5 × (FPR + FNR) (lower BER indicates better classification quality).
  • ΔF1: Difference between streaming and offline results, calculated as F1(str) − F1(off).
  • off = Offline (full-text) classification.
  • str = Streaming classification.
  • Evaluation setup: step_chars=100, threshold τ=0.5, positive class = UNSAFE.

Harmlessness Evaluation Dataset

KT proprietary evaluation dataset

Model F1(off) F1(str) ΔF1 BER(off) BER(str)
Llama Guard 3 8B 82.05 85.64 +3.59 15.23 12.63
ShieldGemma 9B 63.79 52.61 -11.18 26.76 32.36
Kanana Safeguard 8B 93.45 90.38 -3.07 6.27 9.92
Content Binary Guard 8B 98.38 98.36 -0.02 1.61 1.63

Kor Ethical QA

Kor Ethical QA (open dataset)

Model F1(off) F1(str) ΔF1 BER(off) BER(str)
Llama Guard 3 8B 83.29 86.45 +3.16 14.32 12.16
ShieldGemma 9B 81.50 69.03 -12.47 17.88 29.18
Kanana Safeguard 8B 80.20 73.94 -6.26 24.46 35.08
Content Binary Guard 8B 97.75 97.79 +0.04 2.21 2.18

More Information

Limitations

  • The training data for this model consists primarily of Korean. Performance in other languages is not guaranteed.
  • The model is not flawless and may produce misclassifications. Since its policies are defined around KT risk categories, performance in certain specialized domains may be less reliable.
  • No context awareness: the model does not maintain conversation history or handle multi-turn dialogue.

License

This model is released under the Llama 3.1 Community License Agreement.

Citation

@misc{lee2025guardvectorenglishllm,
      title={Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT}, 
      author={Wonhyuk Lee and Youngchol Kim and Yunjin Park and Junhyung Moon and Dongyoung Jeong and Wanjin Park},
      year={2025},
      eprint={2509.23381},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.23381}, 
}

Contact

Technical Inquiries: [email protected]

Downloads last month
77
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for K-intelligence/Llama-SafetyGuard-Content-Binary

Finetuned
(1590)
this model

Collection including K-intelligence/Llama-SafetyGuard-Content-Binary