File size: 2,596 Bytes
8c85fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77013e3
8c85fce
 
 
77013e3
 
 
 
 
8c85fce
 
 
 
 
508ed1d
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
library_name: transformers
license: apache-2.0
base_model:
- nbeerbower/mistral-nemo-kartoffel-12B
datasets:
- nbeerbower/Schule-DPO
- nbeerbower/Purpura-DPO
- nbeerbower/Arkhaios-DPO
- jondurbin/truthy-dpo-v0.1
- antiven0m/physical-reasoning-dpo
- Atsunori/HelpSteer2-DPO
- GeneralReasoning/GeneralThought-430K
- nvidia/OpenMathReasoning
- nvidia/OpenCodeReasoning
tags:
- orpo
- uncensored
- reasoning
- chain-of-thought
- qlora
- experimental
---

> 🧪 **Experimental Model**
>
> This is one of many experimental iterations I'm sharing publicly while I mess around with training parameters and ideas. It's not a "real" release - just me being transparent about my learning process. Feel free to look under the hood, but don't expect anything production-ready!


![image/png](https://huggingface.co/nbeerbower/Denker-mistral-nemo-12B/resolve/main/denker_cover.png?download=true)

# Denker-mistral-nemo-12B

**Denker** is a small, uncensored, reasoning-focused model finetuned using [ORPO and QLoRA](https://huggingface.co/blog/mlabonne/orpo-llama-3) on top of [mistral-nemo-kartoffel-12B](https://huggingface.co/nbeerbower/mistral-nemo-kartoffel-12B).

This run experiments with the Qwen-style chat template and `<think>...</think>`-style reasoning structure—without modifying the base vocab. All tuning was done via LoRA.

## Finetuning Details

- **Method:** ORPO
- **Epochs:** 0.25
- **Learning Rate:** 8e-6, cosine decay w/ 5% warmup
- **Batch Size:** 1 x 64 (64 effective)
- **Max Grad Norm:** 0.5
- **LoRA Rank:** 128
- **Hardware:** 1x NVIDIA RTX A6000

## Dataset Composition

Thinking disabled:

* [nbeerbower/Schule-DPO](https://huggingface.co/datasets/nbeerbower/Schule-DPO)
* [nbeerbower/Purpura-DPO](https://huggingface.co/datasets/nbeerbower/Purpura-DPO)
* [nbeerbower/Arkhaios-DPO](https://huggingface.co/datasets/nbeerbower/Arkhaios-DPO)
* [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1)
* [antiven0m/physical-reasoning-dpo](https://huggingface.co/datasets/antiven0m/physical-reasoning-dpo)
* [Atsunori/HelpSteer2-DPO](https://huggingface.co/datasets/Atsunori/HelpSteer2-DPO)

### Chain of Thought

30,000 samples of each dataset with thinking enabled.

* [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
* [nvidia/OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning)
* [nvidia/OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning)

## Results

### Observations

The model will sometimes decide not to think.

### Evals

TBD