theblackcat102
/

Nous-Hermes-2-Mixtral-8x7B-20m-DPO-raw

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Nous-Hermes-2-Mixtral-8x7B-20m-DPO-raw / README.md

theblackcat102's picture

Update README.md

f34ff5e verified about 1 year ago

|

history blame contribute delete

1.98 kB

	---
	library_name: transformers
	license: apache-2.0
	---

	# Dynamic 8x7B Mixtral Model

	Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers

	## Model Details

	### Model Description

	MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.

	15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.

	Pruned layers index are as follows:

	```
	[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]
	```

	- Developed by: MistralAI, NousResearch, theblackcat
	- Model type: Modified Mixtral Architecture for dynamic MoE
	- License: apache-2.0

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	- Demo [optional]: [More Information Needed]

	## Uses

	Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.

	```python
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = CustomMixtralForCausalLM.from_pretrained(model_path,
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	load_in_4bit=True,
	trust_remote_code=True
	)
	pytorch_total_params = sum(p.numel() for p in model.parameters())
	print(pytorch_total_params/1e9)
	max_length = 100
	input_text = """<\|im_start\|>user\nHow are you? Write a story for me please<\|im_end\|><\|im_start\|>assistant\n"""
	input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
	print(len(input_ids[0]))
	output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
	print(tokenizer.decode(output[0]))
	```