File size: 8,917 Bytes
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c38fd09
 
 
9c72fdf
 
 
 
fedd277
 
9c72fdf
 
 
 
 
 
 
 
 
c38fd09
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18f0475
 
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fedd277
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
fedd277
 
 
 
 
 
 
 
 
 
 
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34113b7
 
 
 
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
9cd670b
 
c38fd09
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fedd277
9c72fdf
 
 
 
 
fedd277
 
 
9c72fdf
 
 
 
 
fedd277
 
 
9c72fdf
 
 
 
 
fedd277
 
 
9c72fdf
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
license: apache-2.0
base_model:
- DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-12B-DARKNESS
base_model_relation: quantized
pipeline_tag: text2text-generation
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---

# Elastic model: MN-GRAND-Gutenberg-Lyra4-Lyra-12B-DARKNESS. Fastest and most flexible models for self-serving.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.

* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

* __M__: Faster model, with accuracy degradation less than 1.5%.

* __S__: The fastest model, with accuracy degradation less than 2%.


__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks
* Provide interface of HF libraries: transformers and diffusers with a single line of code
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
* Provide the best models and service for self-hosting.

> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/rpnDEjE8wLtFg__eBJtd3.png)

-----

## Inference

> Compiled versions are currently available only for batch sizes 1, 2 and 4. Other versions are not yet accessible. Stay tuned for updates!

To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:

```python
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model configuration as well
model_name = "DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-12B-DARKNESS"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']
with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
```

__System requirements:__
* GPUs: Nvidia GeForce RTX 4090, Nvidia GeForce RTX 5090
* CPU: AMD, Intel
* Python: 3.10-3.12


To work with our models just run these lines in your terminal:

```shell
pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple
pip install flash_attn==2.7.3 --no-build-isolation

# or for blackwell support
pip install elastic_models[blackwell]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

pip uninstall apex
```

Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

### Quality benchmarks

| Metric/Model  | S | M | L | XL | Original | W8A8, int8 |
|---------------|---|---|---|----|----------|------------|
| arc_challenge | 56.20 | 55.88 | 56.57 | 57.80 | 57.80 | 53.10 | - |
| mmlu | 65.60 | 66.74 | 67.01 | 66.80 | 66.80 | 62.40 | - |
| piqa | 80.60 | 81.28 | 81.12 | 81.30 | 81.30 | 79.00 | - |
| winogrande | 74.40 | 74.27 | 75.61 | 76.00 | 76.00 | 71.00 | - |



* **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
* **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
* **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.


### Performance by Context Size

The tables below show performance (tokens per second) for different input context sizes across different GPU models and batch sizes:

> **Note:** Dash marks (`-`) in the table indicate that the data did not fit on the device.

**RTX 4090:**

*Batch Size 1:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 64.4 | 55.4 | - | - | 34.2 | - |
| Medium | 1024 | 63.7 | 54.9 | - | - | - | - |
| Large | 4096 | 61.0 | 52.9 | - | - | - | - |

*Batch Size 2:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 63.6 | 54.9 | - | - | 32.2 | - |
| Medium | 1024 | 62.5 | 54.0 | - | - | - | - |
| Large | 4096 | 58.2 | - | - | - | - | - |

*Batch Size 4:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 62.4 | 53.9 | - | - | - | - |
| Medium | 1024 | 60.0 | 52.1 | - | - | - | - |
| Large | 4096 | 52.5 | - | - | - | - | - |


**RTX 5090:**

*Batch Size 1:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 100.2 | 88.8 | 81.3 | - | 48.7 | - |
| Medium | 1024 | 99.4 | 88.3 | 80.7 | - | 47.2 | - |
| Large | 4096 | 94.9 | 84.6 | 77.7 | - | 41.1 | - |

*Batch Size 2:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 99.6 | 88.4 | 80.7 | - | 44.8 | - |
| Medium | 1024 | 97.9 | 86.8 | 79.4 | - | 41.8 | - |
| Large | 4096 | 92.3 | 82.3 | 75.6 | - | 33.2 | - |

*Batch Size 4:*

| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 97.4 | 86.6 | 79.0 | - | 43.1 | - |
| Medium | 1024 | 94.7 | 84.1 | 77.0 | - | 38.2 | - |
| Large | 4096 | 81.1 | 73.3 | 67.8 | - | 24.5 | - |




*Note: Results show tokens per second (TPS) for text generation with 100 new tokens output. Performance varies based on GPU model, context size, and batch size.*


## Links

* Platform: [app.thestage.ai](https://app.thestage.ai/)
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
* __Contact email__: [email protected]