Safetensors
gemma2
creative-writing

Gemma 3 darkest muse?

#9
by Mispu - opened

darkest muse is my absolute favorite creative writing model. As a poor single-GPU user it is small enough to run well and the quality of the prose is significantly better than other models.

One limitation that I run into frequently though is the 8192 token context window. I saw that the model was based on gemma2 and that context window limitation seems to come from there. Do you have any plans to make a gemma3 version? I have dabbled a little bit with trying to make my own QLoRAs for an abliterated gemma3-12b, but the results have been mixed—I'm new to the whole thing so I probably made some mistakes along the way.

Anyway, a gemma3 12b model trained in the same way as darkest muse could potentially be quite good. The longer context window alone would be a great boost, but with the extra 3b weights, maybe it would retain a bit more reasoning as well...

Are there any plans for this? If not, could you perhaps share more details on how darkest muse was trained? I can infer a fair bit from the model card, but I wonder what the exact training parameters were.

Thanks!

I would love to make a gemma 3 version, but also am gpu poor. If I can get hold of some cloud gpus I absolutely will.

Meanwhile I'll see if I can find some time to write up the training process, I'm pretty sure I saved all the commands & params.

Thank you for your reply! Naturally, I fully understand about being GPU poor. I have no idea what the approximate cost of training would be, but I can definitely kick a few $10s of bucks your way if you have paypal or something. It's better spent on someone who knows what they're doing!

This is completely off the cuff, but Mistral Small 3.1 or Mistral Nemo could also be interesting candidates instead of gemma3. I've been daily driving UnslopNemo for a while because it works well on the single 3080TI I have. UnslopNemo is what I use as a drafter and for longer contexts, darkestmuse is for the more literary prose.

From my experience the mistral models, when abliterated, have fewer refusals than abliterated gemma. For creative writing, the less censorship, the better (well, most of the time anyway haha).

Gemma3 is probably more of a natural evolution considering the gemma2 history with darkest muse, but the mistral models do really seem to punch above their weight. So maybe down the line, if you are willing to share your training steps (no rush), I might have a stab at a mistral small darkestmuse.

Thx I appreciate the offer. Last time training was in the order of a few hundred$, since it was a full fine tune on several H100s including a lot of trial & error. It can probably be done cheaper but the training process was very finicky. So I'll probably wait til I get access to some free gpus (or enough people offer to fund it).

I did actually try out the same recipe on nemo at the time. It degraded to very unpretty repetition & collapse before it got any good. Seems like the training method (SIMPO) has a very narrow window from no effect --> sweet spot --> model collapse. And some models seem more robust to it (i.e. allowing longer training before collapse) than others. The reason this happens is that, unlike other DPO & similar algorithms, SIMPO doesn't have any mechanism to keep the weights from straying too far from the originals. So you keep training, they keep diverging in the direction of your "preferred" pairs. Which has the effect of causing model collapse fairly easily, but also of having profound effects on pushing the model away from its normal style.

The actual training process was

  1. ORPO using the gutenberg3 dpo set (fairly mild training)
  2. SIMPO using the same dataset

here's some training config details that I saved:

# ORPO config:

base_model = "unsloth/gemma-2-9b-it"
sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo

orpo_args = ORPOConfig(
    learning_rate=4e-6,
    lr_scheduler_type="linear",
    beta=0.1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_8bit",
    bf16 = True,
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.02,
    logging_steps=1,
    warmup_steps=10,
    output_dir="./results/",
)


# SIMPO config:

dataset: https://huggingface.co/datasets/sam-paech/gutenbergs_1_2_3_antislop-dpo
max_grad_norm: 0.0008
learning_rate: 0.58e-7
gradient_accumulation_steps: 2 (effectively 16 batch size)


# Model arguments
model_name_or_path: unsloth/gemma-2-9b-it
torch_dtype: null
attn_implementation: eager

# Data training arguments
chat_template: "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
dataset_mixer:
  sam-paech/gutenbergs_1_2_3_antislop-dpo: 1.0
dataset_splits:
- train
- test
preprocessing_num_workers: 1

# SimPOTrainer arguments
bf16: true
beta: 10
gamma_beta_ratio: 0.5
do_eval: true
evaluation_strategy: steps
eval_steps: 400
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
hub_model_id: simpo-exps
learning_rate: 0.58e-7
log_level: info
logging_steps: 5
lr_scheduler_type: cosine
max_length: 4096
max_prompt_length: 512
num_train_epochs: 1
optim: adamw_torch
output_dir: outputs/gemma-2-9b-it-gutenberg3
run_name: gemma-2-9b-it-gutenberg3
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
push_to_hub: false
save_strategy: "steps"
save_steps: 1000000
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
auto_insert_empty_system_msg: False


8x h100s took ~30 min to train

Sign up or log in to comment