Austral-Qwen3-235B / README.md

Update README.md

ce836fb verified about 1 month ago

3.79 kB

	---
	tags:
	- chat
	- roleplay
	- storywriting
	- qwen3
	- finetune
	Language:
	- En
	Pipeline_tag: text-generation
	Tags:
	- Chat
	base_model:
	- Qwen/Qwen3-235B-A22B
	---

	[English](./non-lore-README.md) \| [Francais](./French-README.md)


	It's an SFT ontop of the largest Qwen which nobody seems to have done yet, Trained with a collection of normal Austral(Books, RP Logs, LNs, etc) datasets. I do not totally endorse the model yet and i think there's much work to be done in trying to make a decensored and well-writing finetune of this model but I just released this to give everyone a slight taste of a qwen3 finetune.

	It was also a way for us to test out some Optims to actually get this model to train, Thanks to Intervitens <3
	We used torchtune & an experimental hacky pytorch build: https://github.com/pytorch/pytorch/pull/156203
	We trained this model over 24 Hours on 8xB200s. Graciously provided by Deepinfra & Cognitive Computations.
	Speeds were similar to a 70B trained with roughly the same data.


	## Prompting
	Model has been tuned with the ChatML formatting. A typical input would look like this:

	```py
	<\|im_start\|>system
	system-prompt<\|im_end\|>
	<\|im_start\|>user
	user-prompt<\|im_end\|>
	<\|im_start\|>assistant
	<think>
	</think>
	assistant-prompt<\|im_end\|>
	```


	## Torchtune config

	Thank you so much for Intervitens for helping train this model:

	<details><summary>See Torchtune Trainer config</summary>

	```yaml
	output_dir: ./qwen3_235B_A22B_austral/full
	tokenizer:
	_component_: torchtune.models.qwen3.qwen3_tokenizer
	path: ./Qwen3-235B-A22B-tt/vocab.json
	merges_file: ./Qwen3-235B-A22B-tt/merges.txt
	max_seq_len: 32768
	dataset:
	_component_: torchtune.datasets.pretokenized_dataset
	source: IntervitensInc/test_235B_2-pack
	split: train
	packed: true
	seed: 42
	shuffle: false
	model:
	_component_: torchtune.models.qwen3.qwen3_moe_235b_a22b
	checkpointer:
	_component_: torchtune.training.FullModelTorchTuneCheckpointer
	checkpoint_dir: ./Qwen3-235B-A22B-tt
	checkpoint_files:
	- model-00001-of-00001.bin
	recipe_checkpoint: null
	output_dir: ${output_dir}
	model_type: QWEN3_MOE
	resume_from_checkpoint: false
	enable_async_checkpointing: false
	batch_size: 1
	epochs: 4
	optimizer:
	_component_: torchao.optim.AdamW8bit
	lr: 3.0e-06
	lr_scheduler:
	_component_: torchtune.training.lr_schedulers.get_rex_scheduler
	num_warmup_steps: 100
	loss:
	_component_: torchtune.modules.loss.LinearCrossEntropyLoss
	max_steps_per_epoch: null
	gradient_accumulation_steps: 1
	clip_grad_norm: null
	compile:
	model: true
	loss: true
	scale_grads: true
	optimizer_step: false
	optimizer_in_bwd: true
	device: cuda
	enable_activation_checkpointing: true
	enable_activation_offloading: true
	custom_sharded_layers:
	- tok_embeddings
	- output
	fsdp_cpu_offload: false
	dtype: bf16
	metric_logger:
	_component_: torchtune.training.metric_logging.WandBLogger
	project: qwen3-235-a22b-austral
	log_every_n_steps: 1
	log_peak_memory_stats: true
	log_level: INFO
	```

	</details><br>

	## Credits

	Thank you to [Lucy Knada](https://huggingface.co/lucyknada), [Auri](https://huggingface.co/Auri), [Intervitens](https://huggingface.co/intervitens), [Deepinfra](https://deepinfra.com/), [Cognitive Computations](https://huggingface.co/cognitivecomputations) and the rest of [Anthracite](https://huggingface.co/anthracite-org) &


	## Training
	The training was done for 4 epochs. We used 8 x [B200s](https://www.nvidia.com/en-us/data-center/dgx-b200/) GPUs graciously provided by [Deepinfra](https://deepinfra.com/) for the full-parameter fine-tuning of the model, Tuning was done all thanks to Intervitens.

	## Safety
	It's still aligned to the beliefs of the Chinese Communist Party:
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/0zqE9Wo2DsQT6ucxWfcSd.png)