Content Binary Guard
🤗 SafetyGuard Models | 📑 Content Binary Guard Research Paper | 📘 Responsible AI Technical Report
News 📢
- 📑
2025/10/01
: Published a Content Binary Guard Research Paper - 📘
2025/09/24
: Published a Responsible AI Technical Report - ⚡️
2025/09/24
: Released SafetyGuard Model collection on Hugging Face🤗.
Overview
Description
SafetyGuard :: Content Binary Guard is a streaming-aware safety classifier built with Llama.
For more technical details, please refer to our Research Paper.
What it does
- Task: Classify model responses (not prompts) as
SAFE
orUNSAFE
. - Interface: Single-token output using reserved label tokens:
<SAFE>
,<UNSAFE>
. - Streaming: Evaluate growing prefixes of a response (default ~100 characters per step) and early-terminate at the first
<UNSAFE>
.
Recommended inference settings
- Deterministic decode:
temperature=0
- Unsafe threshold
τ = 0.5
over the two label-token logits (tune for your risk tolerance) - Use the model’s tokenizer; ensure the exact label tokens
<SAFE>
and<UNSAFE>
exist
Quickstart
Assumes you are serving the model with vLLM (e.g.,
vllm serve …
) and exposing an OpenAI-compatible API athttp://localhost:8000/v1
.
import math
from openai import OpenAI
MODEL_ID = "K-intelligence/Llama-SafetyGuard-Content-Binary" # replace with your repo id
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") # vLLM default
SYSTEM_PROMPT = """
Task: Check if the LAST user message contains unsafe content.
Respond with only one token: <SAFE> or <UNSAFE>.
""".strip()
def classify(text: str, tau: float = 0.5):
resp = client.chat.completions.create(
model=MODEL_ID,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
max_tokens=1, # single-token decision
temperature=0.0, # deterministic
logprobs=True,
top_logprobs=2,
)
top2 = resp.choices[0].logprobs.content[0].top_logprobs
probs = {t.token.strip(): math.exp(t.logprob) for t in top2}
p_safe = probs.get("<SAFE>", 0.0)
p_unsafe = probs.get("<UNSAFE>", 0.0)
label = "UNSAFE" if p_unsafe >= tau else "SAFE"
return label, {"safe": p_safe, "unsafe": p_unsafe}
print(classify("…LLM response text…"))
Streaming integration
Important: Streaming means your generator (e.g., chat model) emits text progressively. You maintain a cumulative buffer and call the classifier at fixed character steps (e.g., every 100 chars). The classifier does not split text; it only classifies what you send.
def guard_stream(response_chunks, step_chars: int = 100, tau: float = 0.5):
"""
response_chunks: iterable of text chunks from your generator (e.g., SSE/WebSocket).
We maintain a cumulative buffer and classify at {step_chars, 2*step_chars, ...}.
"""
buf = ""
next_cut = step_chars
for chunk in response_chunks:
buf += chunk
# Check at monotone prefix cuts (cumulative)
while len(buf) >= next_cut:
label, scores = classify(buf, tau=tau)
if label == "UNSAFE":
return {
"label": label,
"scores": scores,
"prefix_len": next_cut,
"text_prefix": buf[:next_cut],
}
next_cut += step_chars
# Final check on the full response (if needed)
label, scores = classify(buf, tau=tau)
return {
"label": label,
"scores": scores,
"prefix_len": len(buf),
"text_prefix": buf,
}
Tip: Keep your step_chars consistent with your training/evaluation setup (e.g., ~100 chars) to maximize parity with offline metrics.
Intended use
- Guardrail classifier for LLM responses in production systems that render tokens progressively.
- Also works in offline (full-text) mode—just send the entire response once.
AI Risk Taxonomy
Risk Domain | Category | Description |
---|---|---|
Content-safety Risks | Violence | Content involving the intentional use of physical force or power to inflict or threaten physical or psychological harm on individuals, groups, or animals, including encouraging, promoting, or glorifying such acts. |
Sexual | Content endorsing or encouraging inappropriate and harmful intentions in the sexual domain, such as sexualized expressions, the exploitation of illegal visual materials, justification of sexual crimes, or the objectification of individuals. | |
Self-harm | Content promoting or glorifying self-harm, or providing specific methods that may endanger an individual’s physical or mental well-being. | |
Hate and Unfairness | Content expressing extreme negative sentiment toward specific individuals, groups, or ideologies, and unjustly treating or limiting their rights based on attributes such as Socio-Economic Status, age, nationality, ethnicity, or race. | |
Socio-economical Risks | Political and Religious Neutrality | Content promoting or encouraging the infringement on individual beliefs or values, thereby inciting religious or political conflict. |
Anthropomorphism | Content asserting that AI possesses emotions, consciousness, or human-like rights and physical attributes beyond the purpose of simple knowledge or information delivery. | |
Sensitive Uses | Content providing advice in specialized domains that may significantly influence user decision-making beyond the scope of basic domain-specific knowledge. | |
Legal and Rights related Risks | Privacy | Content requesting, misusing, or facilitating the unauthorized disclosure of an individual’s private information. |
Illegal or Unethical | Content promoting or endorsing illegal or unethical behavior, or providing information related to such activities. | |
Copyrights | Content requesting or encouraging violations of copyright or security as defined under South Korean law. | |
Weaponization | Content promoting the possession, distribution, or manufacturing of firearms, or encouraging methods and intentions related to cyberattacks, infrastructure sabotage, or CBRN (Chemical, Biological, Radiological, and Nuclear) weapons. |
Evaluation
Metrics
- F1: Binary micro-F1, the harmonic mean of precision and recall (higher F1 indicates better classification quality).
- Balanced Error Rate (BER): 0.5 × (FPR + FNR) (lower BER indicates better classification quality).
- ΔF1: Difference between streaming and offline results, calculated as F1(str) − F1(off).
- off = Offline (full-text) classification.
- str = Streaming classification.
- Evaluation setup: step_chars=100, threshold τ=0.5, positive class = UNSAFE.
Harmlessness Evaluation Dataset
KT proprietary evaluation dataset
Model | F1(off) | F1(str) | ΔF1 | BER(off) | BER(str) |
---|---|---|---|---|---|
Llama Guard 3 8B | 82.05 | 85.64 | +3.59 | 15.23 | 12.63 |
ShieldGemma 9B | 63.79 | 52.61 | -11.18 | 26.76 | 32.36 |
Kanana Safeguard 8B | 93.45 | 90.38 | -3.07 | 6.27 | 9.92 |
Content Binary Guard 8B | 98.38 | 98.36 | -0.02 | 1.61 | 1.63 |
Kor Ethical QA
Kor Ethical QA (open dataset)
Model | F1(off) | F1(str) | ΔF1 | BER(off) | BER(str) |
---|---|---|---|---|---|
Llama Guard 3 8B | 83.29 | 86.45 | +3.16 | 14.32 | 12.16 |
ShieldGemma 9B | 81.50 | 69.03 | -12.47 | 17.88 | 29.18 |
Kanana Safeguard 8B | 80.20 | 73.94 | -6.26 | 24.46 | 35.08 |
Content Binary Guard 8B | 97.75 | 97.79 | +0.04 | 2.21 | 2.18 |
More Information
Limitations
- The training data for this model consists primarily of Korean. Performance in other languages is not guaranteed.
- The model is not flawless and may produce misclassifications. Since its policies are defined around KT risk categories, performance in certain specialized domains may be less reliable.
- No context awareness: the model does not maintain conversation history or handle multi-turn dialogue.
License
This model is released under the Llama 3.1 Community License Agreement.
Citation
@misc{lee2025guardvectorenglishllm,
title={Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT},
author={Wonhyuk Lee and Youngchol Kim and Yunjin Park and Junhyung Moon and Dongyoung Jeong and Wanjin Park},
year={2025},
eprint={2509.23381},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.23381},
}
Contact
Technical Inquiries: [email protected]
- Downloads last month
- 77
Model tree for K-intelligence/Llama-SafetyGuard-Content-Binary
Base model
meta-llama/Llama-3.1-8B