--- tags: - chat - roleplay - storywriting - qwen3 - finetune Language: - En Pipeline_tag: text-generation Tags: - Chat base_model: - Qwen/Qwen3-235B-A22B --- [English](./non-lore-README.md) | [Francais](./French-README.md) It's an SFT ontop of the largest Qwen which nobody seems to have done yet, Trained with a collection of normal Austral(Books, RP Logs, LNs, etc) datasets. I do not totally endorse the model yet and i think there's much work to be done in trying to make a decensored and well-writing finetune of this model but I just released this to give everyone a slight taste of a qwen3 finetune. It was also a way for us to test out some Optims to actually get this model to train, Thanks to Intervitens <3 We used torchtune & an experimental hacky pytorch build: https://github.com/pytorch/pytorch/pull/156203 We trained this model over 24 Hours on 8xB200s. Graciously provided by Deepinfra & Cognitive Computations. Speeds were similar to a 70B trained with roughly the same data. ## Prompting Model has been tuned with the ChatML formatting. A typical input would look like this: ```py <|im_start|>system system-prompt<|im_end|> <|im_start|>user user-prompt<|im_end|> <|im_start|>assistant assistant-prompt<|im_end|> ``` ## Torchtune config Thank you so much for Intervitens for helping train this model:
See Torchtune Trainer config ```yaml output_dir: ./qwen3_235B_A22B_austral/full tokenizer: _component_: torchtune.models.qwen3.qwen3_tokenizer path: ./Qwen3-235B-A22B-tt/vocab.json merges_file: ./Qwen3-235B-A22B-tt/merges.txt max_seq_len: 32768 dataset: _component_: torchtune.datasets.pretokenized_dataset source: IntervitensInc/test_235B_2-pack split: train packed: true seed: 42 shuffle: false model: _component_: torchtune.models.qwen3.qwen3_moe_235b_a22b checkpointer: _component_: torchtune.training.FullModelTorchTuneCheckpointer checkpoint_dir: ./Qwen3-235B-A22B-tt checkpoint_files: - model-00001-of-00001.bin recipe_checkpoint: null output_dir: ${output_dir} model_type: QWEN3_MOE resume_from_checkpoint: false enable_async_checkpointing: false batch_size: 1 epochs: 4 optimizer: _component_: torchao.optim.AdamW8bit lr: 3.0e-06 lr_scheduler: _component_: torchtune.training.lr_schedulers.get_rex_scheduler num_warmup_steps: 100 loss: _component_: torchtune.modules.loss.LinearCrossEntropyLoss max_steps_per_epoch: null gradient_accumulation_steps: 1 clip_grad_norm: null compile: model: true loss: true scale_grads: true optimizer_step: false optimizer_in_bwd: true device: cuda enable_activation_checkpointing: true enable_activation_offloading: true custom_sharded_layers: - tok_embeddings - output fsdp_cpu_offload: false dtype: bf16 metric_logger: _component_: torchtune.training.metric_logging.WandBLogger project: qwen3-235-a22b-austral log_every_n_steps: 1 log_peak_memory_stats: true log_level: INFO ```

## Credits Thank you to [Lucy Knada](https://huggingface.co/lucyknada), [Auri](https://huggingface.co/Auri), [Intervitens](https://huggingface.co/intervitens), [Deepinfra](https://deepinfra.com/), [Cognitive Computations](https://huggingface.co/cognitivecomputations) and the rest of [Anthracite](https://huggingface.co/anthracite-org) & ## Training The training was done for 4 epochs. We used 8 x [B200s](https://www.nvidia.com/en-us/data-center/dgx-b200/) GPUs graciously provided by [Deepinfra](https://deepinfra.com/) for the full-parameter fine-tuning of the model, Tuning was done all thanks to Intervitens. ## Safety It's still aligned to the beliefs of the Chinese Communist Party: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/0zqE9Wo2DsQT6ucxWfcSd.png)