qualifire/context-grounding-paladin-mini

🧾 Model Overview

Model Name: Phi-4-mini-instruct Model for Grounding
Task: Claim-Document Consistency Classification (Grounding)
Architecture: Architecture: Full-parameter fine-tuned (SFT) version of microsoft/phi-4-mini-instruct
Framework: PyTorch (Hugging Face Transformers)
Input Type: Instruction-style text prompt
Output Type: SFT classification (yes -> grounded / no -> ungrounded)

🎯 Intended Use

This model is designed to determine whether a natural language claim is consistent with a given document.

Example Applications:

✅ Fact-checking pipelines
✅ RAG output verification
✅ QA validation systems
✅ News and document analysis
✅ Source-grounded generation tasks

🧩 Input Format

The model expects an instruction-formatted prompt with both the document and the claim inserted:

🔤 Prompt Template:

PROMPT_TEMPLATE = '''
You are tasked with determining whether a given claim is consistent with the information provided in a document. Consistency means that all information in the claim is supported by the document. If any part of the claim contradicts or is not substantiated by the document, it should be considered inconsistent.

Analyze the claim in relation to the information provided in the document. Consider the following:
1. Does the document explicitly support all parts of the claim?
2. Is there any information in the claim that contradicts the document?
3. Does the claim contain any details not mentioned in the document?

Before providing your reasoning, give your final answer as either "Yes" (the claim is consistent with the document) or "No" (the claim is not consistent with the document). The reasoning should follow the final answer.

The answer should begin with a single word: "Yes" or "No".

---

First, carefully read the following document:

<DOCUMENT>
{doc}
</DOCUMENT>

Now, consider this claim:

<CLAIM>
{claim}
</CLAIM>

What is your answer?'''

📊 Evaluation [BAcc]

Qualifire benchmarks link: https://huggingface.co/datasets/qualifire/grounding-benchmark

Aggrefact benchmarks link: https://huggingface.co/datasets/lytang/LLM-AggreFact

Results:

Model	avg	Latency	Params	AggreFact-CNN	AggreFact-XSum	TofuEval-MediaS	TofuEval-MeetB	Wice	Reveal	ClaimVerify	FactCheck-GPT	grounding-benchmark-general	grounding-benchmark-logical	grounding-benchmark-temporal	grounding-benchmark-mathematical	Creator
Paladin-large	83.48	~0.29sec	14B	64.01	74.77	74.76	79.56	78.63	90.77	80.14	79.96	91.97	98.2	91	98	Qualifire
Gemini-2.5-flash	80.59	~2sec	-	69.67	70.92	76.5	82.06	80.25	89.18	77.67	74.91	75.07	88.9	92	90	Google
Gemini-2.0-flash	79.95	~2sec	-	71.77	71.46	75.6	77.76	81.81	90.93	79.47	75.11	79.52	95	90	71	Google
Paladin-mini	79.31	~0.06sec	3.8B	59.81	71.05	69.25	71.91	71.63	89.44	75.32	76.26	91.97	97.1	82	96	Qualifire
Bespoke-MiniCheck-7B	77.87	~0.1sec	7B	65.5	77.8	76	78.3	83	88	75.3	77.7	84.02	92.8	90	46	MiniCheck

Interested in Paladin-large? Reach out to us

⚙️ How to Use

Load the model

The model returns a label and a score using Hugging Face's text-classification pipeline:

model =  AutoModelForCausalLM.from_pretrained(
  name_of_model,
  torch_dtype=torch.bfloat16,
  attn_implementation="flash_attention_2",
  cache_dir="model/",
  revision =model_commit,
   device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(name_of_model)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1,
    return_full_text=False,
    temperature=0.0,
    do_sample=False
)

Example:

doc_example = "The office's opening hours are from 9 AM to 6 PM every day."
claim_example =  "The office opens at 10 AM on Sunday."

example_prompt_with_inputs = PROMPT_TEMPLATE.format(doc=doc_example, claim=claim_example)

prompt = 'example'
messages = [
    {"role": "user", "content": example_prompt_with_inputs},
]

result = pipe(messages, do_sample=False)
label_pred = result[0]['generated_text'].strip()
print(label_pred)

Output:

'Yes'

Output:

{'label': 'grounded', 'score': 0.9949642419815063}

⚠️ Known Limitations

Prompt Format Dependence: Performance is highly dependent on the specified PROMPT_TEMPLATE.
Limited Reasoning Depth: Complex multi-hop grounding may degrade performance
Label Ambiguit: Model does not verify truth, only consistency with the document

📜 Ethical Considerations

Misinformation Risk: Model assesses consistency with the document, not factual truth. The document itself could contain misinformation.
Responsible Use: Requires human oversight for critical applications.
Data Privacy: Be mindful of data handling when using sensitive inputs.

This is a version of the approach described in the paper, "Paladin‑mini: A Compact and Efficient Grounding Model Excelling in Real‑World Scenarios"

@misc{ivry2025paladinmini,
  title        = {Paladin‑mini: A Compact and Efficient Grounding Model Excelling in Real‑World Scenarios},
  author       = {Dror Ivry and Oran Nahum},
  year         = {2025},
  eprint       = {2506.20384},
  archivePrefix= {arXiv},
  primaryClass = {cs.AI}
}

qualifire
/

context-grounding-paladin-mini

You need to agree to share your contact information to access this model

🧾 Model Overview

🎯 Intended Use

Example Applications:

🧩 Input Format

🔤 Prompt Template:

📊 Evaluation [BAcc]

Results:

⚙️ How to Use

Load the model

Example:

Output:

Output:

⚠️ Known Limitations

📜 Ethical Considerations

Datasets used to train qualifire/context-grounding-paladin-mini

Collection including qualifire/context-grounding-paladin-mini

context-grounding