This is a dumb experiment - don't expect it to be good!

I merged a few Mixtral models together then tuned only the routing parameters. There was a pretty steep drop in loss with only a bit of training - went from ~0.99 to ~.7 over about ten million tokens.

I'm hoping this after-the-fact balancing will have reduced some of the nasty behavior typical of current tunes. But maybe it just made it even dumber! We'll see.

Uses ChatML format.

Will update with more details if it turns out promising.

Downloads last month: 491

Safetensors

Model size

47B params

Tensor type

BF16

Model tree for chargoddard/mixtralmerge-8x7B-rebalanced-test

Quantizations

2 models

chargoddard
/

mixtralmerge-8x7B-rebalanced-test

Model tree for chargoddard/mixtralmerge-8x7B-rebalanced-test

Dataset used to train chargoddard/mixtralmerge-8x7B-rebalanced-test