File size: 1,977 Bytes
a16df6a
 
f34ff5e
a16df6a
 
f34ff5e
a16df6a
f34ff5e
a16df6a
 
 
 
 
f34ff5e
 
 
a16df6a
f34ff5e
a16df6a
f34ff5e
 
 
 
 
 
 
a16df6a
 
 
 
 
 
 
 
 
 
 
f34ff5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
library_name: transformers
license: apache-2.0
---

# Dynamic 8x7B Mixtral Model

Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers

## Model Details

### Model Description

MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.

15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.

Pruned layers index are as follows:

```
[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]
```

- **Developed by:** MistralAI, NousResearch, theblackcat
- **Model type:** Modified Mixtral Architecture for dynamic MoE
- **License:** apache-2.0

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.

```python
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = CustomMixtralForCausalLM.from_pretrained(model_path,
                                            torch_dtype=torch.bfloat16,
                                            low_cpu_mem_usage=True,
                                            load_in_4bit=True,
                                            trust_remote_code=True
                                        )
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params/1e9)
max_length = 100
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
print(len(input_ids[0]))
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
print(tokenizer.decode(output[0]))
```