File size: 5,980 Bytes
de6c296
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
language:
- ru
- en
base_model:
- Qwen/Qwen2.5-7B
---

This is an instruction following model (based on Qwen2.5-7B base) optimized for Russian language.

The model was trained in two phases: SFT (training data composition is similar to kolibri-mistral-0427) and RLHF.

Current RLHF pipeline leads to degradation on IFEval, but the overall 'vibe' of the model improves significantly.
I am currently investigating the causes of this degradation and exploring methods to further enhance instruction-following capabilities.

The model uses ChatML template. Adding a system prompt will likely improve the model's performance on your tasks (experiment with it).

## Instruction following evals
The model was tested using the following benchmarks:
- [ruIFEval](https://github.com/NLP-Core-Team/ruIFEval)
- [ifeval](https://github.com/google-research/google-research/tree/master/instruction_following_eval)

|             Eval name              |Strict Value| Loose Value
|---------------------------------|----|----|
|Avg.                             |*43.00*|*49.17*|
|ifeval-prompt-level              |38.63|46.21|
|ifeval-instruction-level         |51.20|57.5|
|ru-ifeval-prompt-level           |35.30|40.48|
|ru-ifeval-instruction-level      |46.88|52.52|

## Russian LLM Arena (proxy eval via JINA)

The table below approximates [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb)
scores using the [JINA Judge model](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-041024).
Take it with a grain of salt.

| Model Name                                       | Score  | 95% CI              | Avg Tokens |
|--------------------------------------------------|--------|---------------------|------------|
| gpt-4-1106-preview                               | 82.8   | (-2.8, 2.6)        | 541        |
| gpt-4o-mini                                      | 75.3   | (-2.2, 2.8)        | 448        |
| qwen-2.5-72b-it                                  | 73.1   | (-3.0, 3.1)        | 557        |
| gemma-2-9b-it-sppo-iter3                         | 70.6   | (-3.7, 3.0)        | 509        |
| gemma-2-27b-it                                   | 68.7   | (-2.9, 3.8)        | 472        |
| t-lite-instruct-0.1                              | 67.5   | (-4.2, 2.7)        | 810        |
| gemma-2-9b-it                                    | 67.0   | (-3.0, 3.8)        | 459        |
| suzume-llama-3-8B-multilingual-orpo-borda-half   | 62.4   | (-3.0, 3.3)        | 682        |
| glm-4-9b-chat                                    | 61.5   | (-3.9, 3.3)        | 568        |
| phi-3-medium-4k-instruct                         | 60.4   | (-3.8, 3.6)        | 566        |
| sfr-iterative-dpo-llama-3-8b-r                   | 57.2   | (-3.8, 4.0)        | 516        |
| **kolibri-qwen2.5-7b-060225-rlhf-1**             | 55.4   | (-3.1, 4.4)        | 383        |
| c4ai-command-r-v01                               | 55.0   | (-3.7, 4.4)        | 529        |
| suzume-llama-3-8b-multilingual                   | 51.9   | (-3.1, 3.4)        | 641        |
| mistral-nemo-instruct-2407                       | 51.9   | (-3.0, 3.0)        | 403        |
| yandex_gpt_pro                                   | 50.3   | (-3.5, 3.0)        | 345        |
| gpt-3.5-turbo-0125                               | 50.0   | (0.0, 0.0)         | 220        |
| hermes-2-theta-llama-3-8b                        | 49.3   | (-3.2, 3.7)        | 485        |
| starling-lm-7b-beta                              | 48.3   | (-3.7, 3.9)        | 629        |
| llama-3-8b-saiga-suzume-ties                     | 47.9   | (-3.9, 5.0)        | 763        |
| llama-3-smaug-8b                                 | 47.6   | (-4.3, 2.9)        | 524        |
| **vikhr-it-5.4-fp16-orpo-v2**                    | 46.8   | (-2.4, 2.2)        | 379        |
| aya-23-8b                                        | 46.1   | (-3.3, 3.6)        | 554        |
| **saiga_llama3_8b_v6**                           | 44.8   | (-2.9, 3.2)        | 471        |
| qwen2-7b-instruct                                | 43.6   | (-3.5, 3.0)        | 340        |
| vikhr-it-5.2-fp16-cp                             | 43.6   | (-3.6, 3.3)        | 543        |
| openchat-3.5-0106                                | 42.8   | (-2.5, 3.8)        | 492        |
| **kolibri-mistral-0427-upd**                     | 42.3   | (-4.1, 4.0)        | 551        |
| paralex-llama-3-8b-sft                           | 41.8   | (-3.7, 3.9)        | 688        |
| llama-3-instruct-8b-sppo-iter3                   | 41.7   | (-4.0, 3.6)        | 502        |
| gpt-3.5-turbo-1106                               | 41.5   | (-2.7, 2.5)        | 191        |
| mistral-7b-instruct-v0.3                         | 41.1   | (-4.1, 2.9)        | 469        |
| gigachat_pro                                     | 40.9   | (-3.2, 2.8)        | 294        |
| openchat-3.6-8b-20240522                         | 39.1   | (-2.9, 3.8)        | 428        |
| vikhr-it-5.3-fp16-32k                            | 38.8   | (-3.2, 3.3)        | 519        |
| hermes-2-pro-llama-3-8b                          | 38.4   | (-3.9, 3.9)        | 463        |
| kolibri-vikhr-mistral-0427                       | 34.5   | (-2.9, 3.1)        | 489        |
| vikhr-it-5.3-fp16                                | 33.5   | (-3.0, 3.8)        | 523        |
| llama-3-instruct-8b-simpo                        | 32.7   | (-3.2, 2.7)        | 417        |
| meta-llama-3-8b-instruct                         | 32.1   | (-3.6, 4.2)        | 450        |
| neural-chat-7b-v3-3                              | 25.9   | (-3.1, 3.2)        | 927        |
| gigachat_lite                                    | 25.4   | (-3.5, 2.7)        | 276        |
| snorkel-mistral-pairrm-dpo                       | 10.3   | (-2.3, 2.6)        | 773        |
| storm-7b                                         | 3.7    | (-1.9, 1.7)        | 419        |