|
--- |
|
tags: |
|
- chat |
|
- roleplay |
|
- storywriting |
|
- qwen3 |
|
- finetune |
|
Language: |
|
- En |
|
Pipeline_tag: text-generation |
|
Tags: |
|
- Chat |
|
base_model: |
|
- Qwen/Qwen3-235B-A22B |
|
--- |
|
|
|
[English](./non-lore-README.md) | [Francais](./French-README.md) |
|
|
|
|
|
It's an SFT ontop of the largest Qwen which nobody seems to have done yet, Trained with a collection of normal Austral(Books, RP Logs, LNs, etc) datasets. I do not totally endorse the model yet and i think there's much work to be done in trying to make a decensored and well-writing finetune of this model but I just released this to give everyone a slight taste of a qwen3 finetune. |
|
|
|
It was also a way for us to test out some Optims to actually get this model to train, Thanks to Intervitens <3 |
|
We used torchtune & an experimental hacky pytorch build: https://github.com/pytorch/pytorch/pull/156203 |
|
We trained this model over 24 Hours on 8xB200s. Graciously provided by Deepinfra & Cognitive Computations. |
|
Speeds were similar to a 70B trained with roughly the same data. |
|
|
|
|
|
## Prompting |
|
Model has been tuned with the ChatML formatting. A typical input would look like this: |
|
|
|
```py |
|
<|im_start|>system |
|
system-prompt<|im_end|> |
|
<|im_start|>user |
|
user-prompt<|im_end|> |
|
<|im_start|>assistant |
|
<think> |
|
</think> |
|
assistant-prompt<|im_end|> |
|
``` |
|
|
|
|
|
## Torchtune config |
|
|
|
Thank you so much for Intervitens for helping train this model: |
|
|
|
<details><summary>See Torchtune Trainer config</summary> |
|
|
|
```yaml |
|
output_dir: ./qwen3_235B_A22B_austral/full |
|
tokenizer: |
|
_component_: torchtune.models.qwen3.qwen3_tokenizer |
|
path: ./Qwen3-235B-A22B-tt/vocab.json |
|
merges_file: ./Qwen3-235B-A22B-tt/merges.txt |
|
max_seq_len: 32768 |
|
dataset: |
|
_component_: torchtune.datasets.pretokenized_dataset |
|
source: IntervitensInc/test_235B_2-pack |
|
split: train |
|
packed: true |
|
seed: 42 |
|
shuffle: false |
|
model: |
|
_component_: torchtune.models.qwen3.qwen3_moe_235b_a22b |
|
checkpointer: |
|
_component_: torchtune.training.FullModelTorchTuneCheckpointer |
|
checkpoint_dir: ./Qwen3-235B-A22B-tt |
|
checkpoint_files: |
|
- model-00001-of-00001.bin |
|
recipe_checkpoint: null |
|
output_dir: ${output_dir} |
|
model_type: QWEN3_MOE |
|
resume_from_checkpoint: false |
|
enable_async_checkpointing: false |
|
batch_size: 1 |
|
epochs: 4 |
|
optimizer: |
|
_component_: torchao.optim.AdamW8bit |
|
lr: 3.0e-06 |
|
lr_scheduler: |
|
_component_: torchtune.training.lr_schedulers.get_rex_scheduler |
|
num_warmup_steps: 100 |
|
loss: |
|
_component_: torchtune.modules.loss.LinearCrossEntropyLoss |
|
max_steps_per_epoch: null |
|
gradient_accumulation_steps: 1 |
|
clip_grad_norm: null |
|
compile: |
|
model: true |
|
loss: true |
|
scale_grads: true |
|
optimizer_step: false |
|
optimizer_in_bwd: true |
|
device: cuda |
|
enable_activation_checkpointing: true |
|
enable_activation_offloading: true |
|
custom_sharded_layers: |
|
- tok_embeddings |
|
- output |
|
fsdp_cpu_offload: false |
|
dtype: bf16 |
|
metric_logger: |
|
_component_: torchtune.training.metric_logging.WandBLogger |
|
project: qwen3-235-a22b-austral |
|
log_every_n_steps: 1 |
|
log_peak_memory_stats: true |
|
log_level: INFO |
|
``` |
|
|
|
</details><br> |
|
|
|
## Credits |
|
|
|
Thank you to [Lucy Knada](https://huggingface.co/lucyknada), [Auri](https://huggingface.co/Auri), [Intervitens](https://huggingface.co/intervitens), [Deepinfra](https://deepinfra.com/), [Cognitive Computations](https://huggingface.co/cognitivecomputations) and the rest of [Anthracite](https://huggingface.co/anthracite-org) & |
|
|
|
|
|
## Training |
|
The training was done for 4 epochs. We used 8 x [B200s](https://www.nvidia.com/en-us/data-center/dgx-b200/) GPUs graciously provided by [Deepinfra](https://deepinfra.com/) for the full-parameter fine-tuning of the model, Tuning was done all thanks to Intervitens. |
|
|
|
## Safety |
|
It's still aligned to the beliefs of the Chinese Communist Party: |
|
 |