File size: 8,691 Bytes
28b97ee 53c605f 28b97ee 478f1d7 28b97ee 279bb9e 1d9a6c4 d3d1e24 9a30e92 76f4a9f 589431d 2fbeb74 d16c030 76f4a9f 53c605f 279bb9e 53c605f 5420176 2fbeb74 76f4a9f 53c605f 478f1d7 28b97ee 5420176 2fbeb74 535f749 279bb9e 2fbeb74 3d51135 279bb9e 2ad0240 279bb9e 2ad0240 f06bcef 2ad0240 f06bcef 2ad0240 f06bcef 2ad0240 d3d1e24 2ad0240 535f749 3d51135 279bb9e 535f749 4495094 2fbeb74 4495094 279bb9e 4495094 135a742 5420176 5fa3cea 4495094 135a742 5420176 135a742 4495094 135a742 5420176 5fa3cea 4495094 28b97ee 478f1d7 5420176 2fbeb74 4495094 135a742 5420176 5fa3cea 478f1d7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
base_model:
- jpacifico/bitnet-dpo-ties-retrained-mirror2
- jpacifico/bitnet-dpo-merged-modelstock2
- jpacifico/bitnet-dpo-merged-modelstock-retrain
- jpacifico/bitnet-dpo-merged-ties2
library_name: transformers
tags:
- mergekit
- merge
license: mit
datasets:
- jpacifico/french-orca-dpo-pairs-revised
- Intel/orca_dpo_pairs
language:
- en
- fr
---
# Model Summary
**Aramis-2B-BitNet** *(2.41B params / Context Length: Maximum sequence length of 4096 tokens)*
A compact, agent-oriented small language model focused on contextual reasoning, language understanding and multi-turn instruction following.
Built with an iterative post-training recipe: bilingual DPO (FR+EN) + model merging of FR-centric and EN-centric variants.
Runs natively as BitNet 1.58-bit (ternary) and is available in GGUF 1.58-bit, lossless from the BF16 checkpoint.
**Why BitNet (and why this model)**
- BitNet b1.58 uses ternary weights (−1,0,+1) with abs-mean scaling : very low memory & energy, great CPU/edge throughput, unlike classic FP/INT SLMs. For more details on the underlying architecture and efficiency of BitNet, please refer to the official Microsoft Research publication: [BitNet b1.58 2B4T Technical Report](https://arxiv.org/abs/2504.12285)
- Aramis demonstrates that a 2B BitNet can deliver SOTA language understanding in its class without sacrificing efficiency.
**Model Variants**
- jpacifico/Aramis-2B-BitNet-bf16 (this repo): Contains the retrainable weights in BF16 format
- [jpacifico/Aramis-2B-BitNet-b1.58-i2s-GGUF](https://huggingface.co/jpacifico/Aramis-2B-BitNet-b1.58-i2s-GGUF) : Quantized 1.58-bit GGUF version, you can use with [bitnet.cpp](https://github.com/microsoft/BitNet)
---
# Training Recipe
Base model : [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16)
Post-Training Goal: agent-oriented behavior → better instruction following, contextual disambiguation, and pragmatic reasoning in multi-turn settings.
Iterative DPO + Model merging :
- Bilingual DPO (FR+EN) to sharpen preference selection across two languages, using the following datasets :
[jpacifico/french-orca-dpo-pairs-revised](https://huggingface.co/datasets/jpacifico/french-orca-dpo-pairs-revised)
[Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
- Model merging (ModelStock and TIES methods, via [Mergekit](https://github.com/cg123/mergekit) to combine complementary strengths of bilingual models (FR-centric + EN-centric), improving robustness across reasoning and comprehension tasks while maintaining stability.
---
# First benchmarks
**Interpretation:** Significant gains on language understanding & pragmatic reasoning (ARC-C/E, Wino, BoolQ, HellaSwag, TriviaQA) with stability on other skills. Math/code are not the optimization target; GSM8K stays essentially stable relative to the bitnet-b1.58-2B-4T quantized baseline (58,38).
All scores are reported in comparison with the original [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16) model.
| Benchmark (metric) | microsoft/bitnet-b1.58-2B-4T-bf16 | jpacifico/Aramis-2B-BitNet-bf16|
|------------------------------------|-----------------------------------|--------------------------------|
| arc_challenge 0 shot | 47.95 | **51.62** |
| arc_easy 0 shot | 73.44 | **75.25** |
| hellaswag 0 shot | 68.27 | **68.52** |
| openbookqa 0 shot | **41.6** | 41.4 |
| boolq 0 shot | **79.39** | 79.33 |
| piqa 0 shot | **77.86** | 77.53 |
| winogrande 0 shot | 70.64 | **72.06** |
| ifeval 0 shot | 41.85 | **44.12** |
| triviaqa 0 shot | 11.95 | **15.06** |
| triviaqa 5 shot EM | 33.51 | 33.51 |
| truthfulqa_mc2 10 shot | 45.89 | **46.52** |
| gsm8k 4 shot EM | **62.4** | 59.67 |
| mmlu 5 shot acc | 52.96 | **53.39** |
| commonsense_qa 10 shot acc | **71.17** | 70.76 |
**ARC-Challenge (zero-shot):** 51.62 — first-ever ≥50 reported for a 2B-class model (>1.5B, <2.5B) *based on publicly available results*.
| Model | arc_challenge (0 shot) |
|----------------------------------------------------|------------------------|
| Qwen/Qwen3-1.7B | 43 |
| ibm-granite/granite-3.3-2b-base | 44,54 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 34,9 |
| openbmb/MiniCPM-2B-dpo-bf16 | 44,28 |
| microsoft/bitnet-b1.58-2B-4T-bf16 (base model) | 47,95 |
| microsoft/bitnet-b1.58-2B-4T | 49,91 |
| jpacifico/Aramis-2B-BitNet-bf16 | **51,62** |
### Reproducibility
All benchmark results reported here were obtained using [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness).
The following example reproduces the **ARC-Challenge (0-shot)** evaluation for this model:
```bash
HF_ALLOW_CODE_EVAL=1 lm-eval --model hf \
--model_args pretrained=jpacifico/Aramis-2B-BitNet-bf16,dtype=bfloat16 \
--tasks arc_challenge \
--device cuda:0 --batch_size 8 \
--seed 42 \
--num_fewshot 0 \
--confirm_run_unsafe_code \
--trust_remote_code
```
- All results were computed with LM Eval Harness v0.4.9
- Randomness (e.g. seeds, batch sizes) may cause slight variations in results
- The same procedure was used to evaluate all tasks presented in the benchmark tables
---
# Usage with `bitnet.cpp`
You can run this model using my demo [Colab notebook](https://github.com/jpacifico/Aramis-BitNet/blob/main/Aramis_BitNet_inference_test.ipynb)
Please refer to the [bitnet.cpp](https://github.com/microsoft/BitNet) GitHub repository for detailed compilation steps, usage examples, and command-line options.
---
# Last checkpoint
### Merge Method
This model was merged using the [Model Stock](https://arxiv.org/abs/2403.19522) merge method using [jpacifico/bitnet-dpo-merged-modelstock-retrain](https://huggingface.co/jpacifico/bitnet-dpo-merged-modelstock-retrain) as a base.
### Models Merged
The following models were included in the merge:
* [jpacifico/bitnet-dpo-ties-retrained-mirror2](https://huggingface.co/jpacifico/bitnet-dpo-ties-retrained-mirror2)
* [jpacifico/bitnet-dpo-merged-modelstock2](https://huggingface.co/jpacifico/bitnet-dpo-merged-modelstock2)
* [jpacifico/bitnet-dpo-merged-ties2](https://huggingface.co/jpacifico/bitnet-dpo-merged-ties2)
### Configuration
The following YAML configuration was used to produce this model:
```yaml
models:
- model: jpacifico/bitnet-dpo-merged-ties2
- model: jpacifico/bitnet-dpo-merged-modelstock2
- model: jpacifico/bitnet-dpo-ties-retrained-mirror2
- model: jpacifico/bitnet-dpo-merged-modelstock-retrain
merge_method: model_stock
base_model: jpacifico/bitnet-dpo-merged-modelstock-retrain
parameters:
normalize: true
dtype: bfloat16
tokenizer_source: jpacifico/bitnet-dpo-merged-modelstock-retrain
```
---
# Limitations
Not tuned for coding or formal math; prefer specialized variants if those are critical.
No explicit chain-of-thought training; improvements come from bilingual DPO + merging.
**Disclamer**
This model is intended for research and development purposes only and should not be used in commercial or real-world applications without further testing. While the Microsoft Research team has applied SFT and DPO to align the BitNet base model, it may still produce unexpected, biased, or inaccurate outputs. Please use responsibly.
---
- **Developed by:** Jonathan Pacifico, 2025
- **Model type:** LLM
- **Language(s) (NLP):** French, English
- **License:** MIT
Made with ❤️ in France |