|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
# Dynamic 8x7B Mixtral Model |
|
|
|
Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations. |
|
|
|
15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B. |
|
|
|
Pruned layers index are as follows: |
|
|
|
``` |
|
[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29] |
|
``` |
|
|
|
- **Developed by:** MistralAI, NousResearch, theblackcat |
|
- **Model type:** Modified Mixtral Architecture for dynamic MoE |
|
- **License:** apache-2.0 |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo [optional]:** [More Information Needed] |
|
|
|
## Uses |
|
|
|
Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config. |
|
|
|
```python |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model = CustomMixtralForCausalLM.from_pretrained(model_path, |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
load_in_4bit=True, |
|
trust_remote_code=True |
|
) |
|
pytorch_total_params = sum(p.numel() for p in model.parameters()) |
|
print(pytorch_total_params/1e9) |
|
max_length = 100 |
|
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n""" |
|
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda') |
|
print(len(input_ids[0])) |
|
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True) |
|
print(tokenizer.decode(output[0])) |
|
``` |
|
|