File size: 2,596 Bytes
8c85fce 77013e3 8c85fce 77013e3 8c85fce 508ed1d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
library_name: transformers
license: apache-2.0
base_model:
- nbeerbower/mistral-nemo-kartoffel-12B
datasets:
- nbeerbower/Schule-DPO
- nbeerbower/Purpura-DPO
- nbeerbower/Arkhaios-DPO
- jondurbin/truthy-dpo-v0.1
- antiven0m/physical-reasoning-dpo
- Atsunori/HelpSteer2-DPO
- GeneralReasoning/GeneralThought-430K
- nvidia/OpenMathReasoning
- nvidia/OpenCodeReasoning
tags:
- orpo
- uncensored
- reasoning
- chain-of-thought
- qlora
- experimental
---
> 🧪 **Experimental Model**
>
> This is one of many experimental iterations I'm sharing publicly while I mess around with training parameters and ideas. It's not a "real" release - just me being transparent about my learning process. Feel free to look under the hood, but don't expect anything production-ready!

# Denker-mistral-nemo-12B
**Denker** is a small, uncensored, reasoning-focused model finetuned using [ORPO and QLoRA](https://huggingface.co/blog/mlabonne/orpo-llama-3) on top of [mistral-nemo-kartoffel-12B](https://huggingface.co/nbeerbower/mistral-nemo-kartoffel-12B).
This run experiments with the Qwen-style chat template and `<think>...</think>`-style reasoning structure—without modifying the base vocab. All tuning was done via LoRA.
## Finetuning Details
- **Method:** ORPO
- **Epochs:** 0.25
- **Learning Rate:** 8e-6, cosine decay w/ 5% warmup
- **Batch Size:** 1 x 64 (64 effective)
- **Max Grad Norm:** 0.5
- **LoRA Rank:** 128
- **Hardware:** 1x NVIDIA RTX A6000
## Dataset Composition
Thinking disabled:
* [nbeerbower/Schule-DPO](https://huggingface.co/datasets/nbeerbower/Schule-DPO)
* [nbeerbower/Purpura-DPO](https://huggingface.co/datasets/nbeerbower/Purpura-DPO)
* [nbeerbower/Arkhaios-DPO](https://huggingface.co/datasets/nbeerbower/Arkhaios-DPO)
* [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1)
* [antiven0m/physical-reasoning-dpo](https://huggingface.co/datasets/antiven0m/physical-reasoning-dpo)
* [Atsunori/HelpSteer2-DPO](https://huggingface.co/datasets/Atsunori/HelpSteer2-DPO)
### Chain of Thought
30,000 samples of each dataset with thinking enabled.
* [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
* [nvidia/OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning)
* [nvidia/OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning)
## Results
### Observations
The model will sometimes decide not to think.
### Evals
TBD |