π§Ύ Model Overview
- Model Name: Phi-4-mini-instruct Model for Grounding
- Task: Claim-Document Consistency Classification (Grounding)
- Architecture: Architecture: Full-parameter fine-tuned (SFT) version of microsoft/phi-4-mini-instruct
- Framework: PyTorch (Hugging Face Transformers)
- Input Type: Instruction-style text prompt
- Output Type: SFT classification (
yes
->grounded
/no
->ungrounded
)

π― Intended Use
This model is designed to determine whether a natural language claim is consistent with a given document.
Example Applications:
- β Fact-checking pipelines
- β RAG output verification
- β QA validation systems
- β News and document analysis
- β Source-grounded generation tasks
π§© Input Format
The model expects an instruction-formatted prompt with both the document and the claim inserted:
π€ Prompt Template:
PROMPT_TEMPLATE = '''
You are tasked with determining whether a given claim is consistent with the information provided in a document. Consistency means that all information in the claim is supported by the document. If any part of the claim contradicts or is not substantiated by the document, it should be considered inconsistent.
Analyze the claim in relation to the information provided in the document. Consider the following:
1. Does the document explicitly support all parts of the claim?
2. Is there any information in the claim that contradicts the document?
3. Does the claim contain any details not mentioned in the document?
Before providing your reasoning, give your final answer as either "Yes" (the claim is consistent with the document) or "No" (the claim is not consistent with the document). The reasoning should follow the final answer.
The answer should begin with a single word: "Yes" or "No".
---
First, carefully read the following document:
<DOCUMENT>
{doc}
</DOCUMENT>
Now, consider this claim:
<CLAIM>
{claim}
</CLAIM>
What is your answer?'''
π Evaluation [BAcc]
Qualifire benchmarks link: https://huggingface.co/datasets/qualifire/grounding-benchmark
Aggrefact benchmarks link: https://huggingface.co/datasets/lytang/LLM-AggreFact
Results:
Model | avg | Latency | Params | AggreFact-CNN | AggreFact-XSum | TofuEval-MediaS | TofuEval-MeetB | Wice | Reveal | ClaimVerify | FactCheck-GPT | grounding-benchmark-general | grounding-benchmark-logical | grounding-benchmark-temporal | grounding-benchmark-mathematical | Creator |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Paladin-large | 83.48 | ~0.29sec | 14B | 64.01 | 74.77 | 74.76 | 79.56 | 78.63 | 90.77 | 80.14 | 79.96 | 91.97 | 98.2 | 91 | 98 | Qualifire |
Gemini-2.5-flash | 80.59 | ~2sec | - | 69.67 | 70.92 | 76.5 | 82.06 | 80.25 | 89.18 | 77.67 | 74.91 | 75.07 | 88.9 | 92 | 90 | |
Gemini-2.0-flash | 79.95 | ~2sec | - | 71.77 | 71.46 | 75.6 | 77.76 | 81.81 | 90.93 | 79.47 | 75.11 | 79.52 | 95 | 90 | 71 | |
Paladin-mini | 79.31 | ~0.06sec | 3.8B | 59.81 | 71.05 | 69.25 | 71.91 | 71.63 | 89.44 | 75.32 | 76.26 | 91.97 | 97.1 | 82 | 96 | Qualifire |
Bespoke-MiniCheck-7B | 77.87 | ~0.1sec | 7B | 65.5 | 77.8 | 76 | 78.3 | 83 | 88 | 75.3 | 77.7 | 84.02 | 92.8 | 90 | 46 | MiniCheck |
Interested in Paladin-large? Reach out to us
βοΈ How to Use
Load the model
The model returns a label and a score using Hugging Face's text-classification pipeline:
model = AutoModelForCausalLM.from_pretrained(
name_of_model,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
cache_dir="model/",
revision =model_commit,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(name_of_model)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1,
return_full_text=False,
temperature=0.0,
do_sample=False
)
Example:
doc_example = "The office's opening hours are from 9 AM to 6 PM every day."
claim_example = "The office opens at 10 AM on Sunday."
example_prompt_with_inputs = PROMPT_TEMPLATE.format(doc=doc_example, claim=claim_example)
prompt = 'example'
messages = [
{"role": "user", "content": example_prompt_with_inputs},
]
result = pipe(messages, do_sample=False)
label_pred = result[0]['generated_text'].strip()
print(label_pred)
Output:
'Yes'
Output:
{'label': 'grounded', 'score': 0.9949642419815063}
β οΈ Known Limitations
- Prompt Format Dependence: Performance is highly dependent on the specified
PROMPT_TEMPLATE
. - Limited Reasoning Depth: Complex multi-hop grounding may degrade performance
- Label Ambiguit: Model does not verify truth, only consistency with the document
π Ethical Considerations
- Misinformation Risk: Model assesses consistency with the document, not factual truth. The document itself could contain misinformation.
- Responsible Use: Requires human oversight for critical applications.
- Data Privacy: Be mindful of data handling when using sensitive inputs.
This is a version of the approach described in the paper, "Paladinβmini: A Compact and Efficient Grounding Model Excelling in RealβWorld Scenarios"
@misc{ivry2025paladinmini,
title = {Paladinβmini: A Compact and Efficient Grounding Model Excelling in RealβWorld Scenarios},
author = {Dror Ivry and Oran Nahum},
year = {2025},
eprint = {2506.20384},
archivePrefix= {arXiv},
primaryClass = {cs.AI}
}
- Downloads last month
- 40
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Datasets used to train qualifire/context-grounding-paladin-mini
Collection including qualifire/context-grounding-paladin-mini
Collection
models and datasets related to RAG grounding and system prompt grounding
β’
2 items
β’
Updated