psynote123 commited on
Commit
e571264
·
verified ·
1 Parent(s): 8689c91

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +242 -3
README.md CHANGED
@@ -1,3 +1,242 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - mistralai/Mistral-Small-3.1-24B-Instruct-2503
5
+ base_model_relation: quantized
6
+ pipeline_tag: text2text-generation
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Elastic model: Mistral-Small-3.1-24B-Instruct-2503. Fastest and most flexible models for self-serving.
24
+
25
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
+
27
+ * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
+
29
+ * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
+
31
+ * __M__: Faster model, with accuracy degradation less than 1.5%.
32
+
33
+ * __S__: The fastest model, with accuracy degradation less than 2%.
34
+
35
+
36
+ __Goals of elastic models:__
37
+
38
+ * Provide flexibility in cost vs quality selection for inference
39
+ * Provide clear quality and latency benchmarks
40
+ * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
+ * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
+ * Provide the best models and service for self-hosting.
43
+
44
+ > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
+
46
+
47
+ -----
48
+
49
+ ## Inference
50
+
51
+ To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoTokenizer
56
+ from elastic_models.transformers import AutoModelForCausalLM
57
+
58
+ # Currently we require to have your HF token
59
+ # as we use original weights for part of layers and
60
+ # model confugaration as well
61
+ model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
62
+ hf_token = ''
63
+ device = torch.device("cuda")
64
+
65
+ # Create mode
66
+ tokenizer = AutoTokenizer.from_pretrained(
67
+ model_name, token=hf_token
68
+ )
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ token=hf_token,
72
+ torch_dtype=torch.bfloat16,
73
+ attn_implementation="sdpa",
74
+ mode='S'
75
+ ).to(device)
76
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
77
+
78
+ # Inference simple as transformers library
79
+ prompt = "Describe basics of DNNs quantization."
80
+ messages = [
81
+ {
82
+ "role": "system",
83
+ "content": "You are a search bot, answer on user text queries."
84
+ },
85
+ {
86
+ "role": "user",
87
+ "content": prompt
88
+ }
89
+ ]
90
+
91
+ chat_prompt = tokenizer.apply_chat_template(
92
+ messages, add_generation_prompt=True, tokenize=False
93
+ )
94
+
95
+ inputs = tokenizer(chat_prompt, return_tensors="pt")
96
+ inputs.to(device)
97
+
98
+ with torch.inference_mode():
99
+ generate_ids = model.generate(**inputs, max_length=500)
100
+
101
+ input_len = inputs['input_ids'].shape[1]
102
+ generate_ids = generate_ids[:, input_len:]
103
+ output = tokenizer.batch_decode(
104
+ generate_ids,
105
+ skip_special_tokens=True,
106
+ clean_up_tokenization_spaces=False
107
+ )[0]
108
+
109
+ # Validate answer
110
+ print(f"# Q:\n{prompt}\n")
111
+ print(f"# A:\n{output}\n")
112
+ ```
113
+
114
+ __System requirements:__
115
+ * GPUs: H100, L40s
116
+ * CPU: AMD, Intel
117
+ * Python: 3.10-3.12
118
+
119
+
120
+ To work with our models just run these lines in your terminal:
121
+
122
+ ```shell
123
+ pip install thestage
124
+ pip install elastic_models[nvidia]\
125
+ --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
126
+ --extra-index-url https://pypi.nvidia.com\
127
+ --extra-index-url https://pypi.org/simple
128
+
129
+ pip install flash_attn==2.7.3 --no-build-isolation
130
+ pip uninstall apex
131
+ ```
132
+
133
+ Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
134
+
135
+ ```shell
136
+ thestage config set --api-token <YOUR_API_TOKEN>
137
+ ```
138
+
139
+ Congrats, now you can use accelerated models!
140
+
141
+ ----
142
+
143
+ ## Benchmarks
144
+
145
+ Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
146
+
147
+ ### Quality benchmarks
148
+
149
+ | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
150
+ |---------------|---|---|---|----|----------|------------|
151
+ | arc_challenge | 65.30 | 66.30 | 66.70 | 66.80 | 66.80 | 65.30 | - |
152
+ | gsm8k | 87.70 | 87.80 | 88.00 | - | - | 87.70 | - |
153
+ | mmlu | 79.00 | 79.40 | 79.70 | 80.20 | 80.20 | 79.00 | - |
154
+ | piqa | 82.90 | 83.10 | 82.60 | 83.00 | 83.00 | 82.90 | - |
155
+ | winogrande | 78.20 | 79.40 | 79.30 | 79.50 | 79.50 | 78.20 | - |
156
+
157
+
158
+
159
+ * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
160
+ * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
161
+ * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
162
+ * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
163
+
164
+ ### Latency benchmarks
165
+
166
+ __100 input/300 output; tok/s:__
167
+
168
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
169
+ |-----------|-----|---|---|----|----------|------------|
170
+ | H100 | 90 | -1 | -1 | -1 | -1 | -1 | - |
171
+ | L40S | -1 | -1 | -1 | -1 | -1 | -1 | - |
172
+
173
+
174
+ ### Performance by Context Size
175
+
176
+ The tables below show performance (tokens per second) for different input context sizes across different GPU models and batch sizes:
177
+
178
+ **H100:**
179
+
180
+ *Batch Size 1:*
181
+
182
+ | Context | Input Tokens | S | M | L | XL | Original |
183
+ |---------|-------------|---|---|---|----|---------|
184
+ | Small | 93 | 90.3 | - | - | - | - |
185
+ | Medium | 1024 | 89.6 | - | - | - | - |
186
+ | Large | 4096 | 87.5 | - | - | - | - |
187
+
188
+ *Batch Size 8:*
189
+
190
+ | Context | Input Tokens | S | M | L | XL | Original |
191
+ |---------|-------------|---|---|---|----|---------|
192
+ | Small | 93 | 87.3 | - | - | - | - |
193
+ | Medium | 1024 | 79.9 | - | - | - | - |
194
+ | Large | 4096 | 63.2 | - | - | - | - |
195
+
196
+ *Batch Size 16:*
197
+
198
+ | Context | Input Tokens | S | M | L | XL | Original |
199
+ |---------|-------------|---|---|---|----|---------|
200
+ | Small | 93 | 85.8 | - | - | - | - |
201
+ | Medium | 1024 | 79.0 | - | - | - | - |
202
+ | Large | 4096 | 62.2 | - | - | - | - |
203
+
204
+
205
+ **L40S:**
206
+
207
+ *Batch Size 1:*
208
+
209
+ | Context | Input Tokens | S | M | L | XL | Original |
210
+ |---------|-------------|---|---|---|----|---------|
211
+ | Small | 93 | - | - | - | - | - |
212
+ | Medium | 1024 | - | - | - | - | - |
213
+ | Large | 4096 | - | - | - | - | - |
214
+
215
+ *Batch Size 8:*
216
+
217
+ | Context | Input Tokens | S | M | L | XL | Original |
218
+ |---------|-------------|---|---|---|----|---------|
219
+ | Small | 93 | - | - | - | - | - |
220
+ | Medium | 1024 | - | - | - | - | - |
221
+ | Large | 4096 | - | - | - | - | - |
222
+
223
+ *Batch Size 16:*
224
+
225
+ | Context | Input Tokens | S | M | L | XL | Original |
226
+ |---------|-------------|---|---|---|----|---------|
227
+ | Small | 93 | - | - | - | - | - |
228
+ | Medium | 1024 | - | - | - | - | - |
229
+ | Large | 4096 | - | - | - | - | - |
230
+
231
+
232
+
233
+
234
+ *Note: Results show tokens per second (TPS) for text generation with 100 new tokens output. Performance varies based on GPU model, context size, and batch size.*
235
+
236
+
237
+ ## Links
238
+
239
+ * Platform: [app.thestage.ai](https://app.thestage.ai/)
240
+ * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
241
+ <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
242
+ * __Contact email__: [email protected]