File size: 8,046 Bytes
e571264 fee6219 e571264 d8e4d91 e571264 8cbcc94 e571264 85fa502 fb77b54 85fa502 e571264 fee6219 e571264 fee6219 e571264 fee6219 e571264 fee6219 e571264 fee6219 3d42388 e571264 3d42388 fee6219 e571264 3d42388 fee6219 e571264 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
---
license: apache-2.0
base_model:
- mistralai/Mistral-Small-3.1-24B-Instruct-2503
base_model_relation: quantized
pipeline_tag: text2text-generation
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---
# Elastic model: Mistral-Small-3.1-24B-Instruct-2503. Fastest and most flexible models for self-serving.
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
* __M__: Faster model, with accuracy degradation less than 1.5%.
* __S__: The fastest model, with accuracy degradation less than 2%.
__Goals of elastic models:__
* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks
* Provide interface of HF libraries: transformers and diffusers with a single line of code
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
* Provide the best models and service for self-hosting.
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

-----
## Inference
> Compiled versions are currently available only for batch sizes 1, 8 and 16. Other versions are not yet accessible. Stay tuned for updates!
To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
```python
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM
# Currently we require to have your HF token
# as we use original weights for part of layers and
# model configuration as well
model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
hf_token = ''
device = torch.device("cuda")
# Create mode
tokenizer = AutoTokenizer.from_pretrained(
model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
token=hf_token,
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id
# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
{
"role": "system",
"content": "You are a search bot, answer on user text queries."
},
{
"role": "user",
"content": prompt
}
]
chat_prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_length=500)
input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
```
__System requirements:__
* GPUs: H100, L40s
* CPU: AMD, Intel
* Python: 3.10-3.12
To work with our models just run these lines in your terminal:
```shell
pip install thestage
pip install elastic_models[nvidia]\
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
--extra-index-url https://pypi.nvidia.com\
--extra-index-url https://pypi.org/simple
pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex
```
Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
```shell
thestage config set --api-token <YOUR_API_TOKEN>
```
Congrats, now you can use accelerated models!
----
## Benchmarks
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
### Quality benchmarks
| Metric/Model | S | M | L | XL | Original | W8A8, int8 |
|---------------|---|---|---|----|----------|------------|
| arc_challenge | 65.30 | 66.30 | 66.70 | 66.80 | 66.80 | 51.10 | - |
| gsm8k | 87.70 | 88.40 | 87.70 | 88.86 | 88.86 | 13.49 | - |
| mmlu | 79.00 | 79.40 | 79.70 | 80.20 | 80.20 | 60.45 | - |
| piqa | 82.90 | 83.10 | 82.60 | 83.00 | 83.00 | 75.35 | - |
| winogrande | 78.20 | 79.40 | 79.30 | 79.50 | 79.50 | 71.19 | - |
* **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
* **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
* **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
* **GSM8K**: GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems.
### Latency benchmarks
### Performance by Context Size
The tables below show performance (tokens per second) for different input context sizes across different GPU models and batch sizes:
**H100:**
*Batch Size 1:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 90.3 | 82.5 | 72.2 | 54.4 | 41.2 | - |
| Medium | 1024 | 90.1 | 82.2 | 71.8 | - | 38.8 | - |
| Large | 4096 | 88.2 | 81.0 | 70.4 | - | 33.8 | - |
*Batch Size 8:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 86.5 | 79.9 | 69.1 | - | 36.7 | - |
| Medium | 1024 | 80.3 | 74.9 | 65.1 | - | 29.0 | - |
| Large | 4096 | 63.3 | 59.5 | 53.1 | - | 15.5 | - |
*Batch Size 16:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 84.7 | 78.1 | 68.0 | - | 32.2 | - |
| Medium | 1024 | 79.8 | 73.3 | 64.1 | - | 21.8 | - |
| Large | 4096 | 62.5 | 58.1 | 52.7 | - | 9.7 | - |
**L40S:**
*Batch Size 1:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 26.0 | 24.0 | 21.0 | - | - | - |
| Medium | 1024 | 25.8 | 23.8 | 20.9 | - | - | - |
| Large | 4096 | 25.1 | 23.3 | 20.5 | - | - | - |
*Batch Size 8:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 25.2 | 23.2 | 20.4 | - | - | - |
| Medium | 1024 | 24.3 | 22.4 | 19.8 | - | - | - |
| Large | 4096 | - | - | - | - | - | - |
*Batch Size 16:*
| Context | Input Tokens | S | M | L | XL | Original |
|---------|-------------|---|---|---|----|---------|
| Small | 256 | 24.5 | 22.6 | 19.9 | - | - | - |
| Medium | 1024 | 22.8 | 20.9 | - | - | - | - |
| Large | 4096 | - | - | - | - | - | - |
*Note: Results show tokens per second (TPS) for text generation with 100 new tokens output. Performance varies based on GPU model, context size, and batch size.*
## Links
* Platform: [app.thestage.ai](https://app.thestage.ai/)
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
* __Contact email__: [email protected]
|