Renamed to parm-2 Please note, the low IFEVAL results is due to this model always reasoning, instruction following is limited, which caused it to have very low ifeval results, this should not matter for most use cases. gguf/final version: https://huggingface.co/Pinkstack/PARM-V2-phi-4-16k-CoT-o1-gguf

This model can be merged with phi-4 based LLMs!

Phi-4 Technical Report superthoughts 14B openllm detailed results Phi-4 that has been tuned to be more advanced at reasoning.

Unlike other Parm models we had to optimize our fine tuning process to ensure accuracy while still being able to release this model. Training loss: 0.443800

Beats qwen/qwq at MATH & MuSR & GPQA (MuSR being a reasoning benchmark) Evaluation:

the model can use this prompt format: (modified phi-4 prompt, by adding a system prompt telling the model to reason before responding you'll get a similar if not better response)

{{ if .System }}<|system|>
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|im_end|>
{{ end }}<|assistant|>{{ .CoT }}<|CoT|>
{{ .Response }}<|FinalAnswer|><|im_end|>

It is recommended to use a system prompt like this one:

You are a helpful ai assistant. Make sure to put your finalanswer at the end.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here! Summarized results can be found here!

Metric	Value (%)
Average	31.17
IFEval (0-Shot)	5.15
BBH (3-Shot)	52.85
MATH Lvl 5 (4-Shot)	40.79
GPQA (0-shot)	19.02
MuSR (0-shot)	21.79
MMLU-PRO (5-shot)	47.43

other leaderboard

According to https://llm.extractum.io/list/?benchmark=score_elo, this model is in the top 20 on their LMSys ELO score leaderboard.

🧀 Examples:

(q4_k_m, 10GB rtx 3080, 64GB memory, running inside of MSTY, all use "You are a friendly ai assistant." as the System prompt.) example 1: example 2: example 3: example 4:

All generated locally

🧀 Information

⚠️ A low temperature must be used to ensure it won't fail at reasoning. we use 0.3 - 0.8!
⚠️ Due to the current prompt format, it may sometimes put <|FinalAnswer|> without providing a final answer at the end, you can ignore this or modify the prompt format. This is our flagship model, with top-tier reasoning, rivaling gemini-flash-exp-2.0-thinking and o1 mini. Overall, the results are similar to both of them.

Uploaded model

Developed by: Pinkstack
License: MIT
Finetuned from model : microsoft/phi-4

This phi-4 model was trained with Unsloth and Huggingface's TRL library.

Pinkstack
/

Parm-2-CoT-14B-16k-o1-QwQ

Open LLM Leaderboard Evaluation Results

other leaderboard

🧀 Examples:

🧀 Information

Uploaded model

Model tree for Pinkstack/Parm-2-CoT-14B-16k-o1-QwQ

Dataset used to train Pinkstack/Parm-2-CoT-14B-16k-o1-QwQ

Collections including Pinkstack/Parm-2-CoT-14B-16k-o1-QwQ

Parm v2

📌Pinned

Evaluation results