lapp0
/

distily_experiments_loss_reverse_kl

Generated from Trainer

8-bit precision

Model card Files Files and versions Metrics Training metrics Community

distily_experiments_loss_reverse_kl / README.md

lapp0's picture

End of training

01d6737 verified 11 months ago

|

history blame contribute delete

3.23 kB

	---
	base_model: Qwen/Qwen2-0.5B-Instruct
	library_name: distily
	license: apache-2.0
	tags:
	- generated_from_trainer
	model-index:
	- name: distily_experiments_loss_reverse_kl
	results: []
	---

	# distily_experiments_loss_reverse_kl

	This student model is distilled from the teacher model [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) using the dataset (unspecified).

	The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

	It achieves the following results on the evaluation set:
	- eval_enwikippl: 2760.3779
	- eval_frwikippl: 28158.2578
	- eval_zhwikippl: 441247.4688
	- eval_loss: 3.1654
	- eval_runtime: 90.8509
	- eval_samples_per_second: 11.007
	- eval_steps_per_second: 2.752

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment.

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed
	-->

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- distillation_strategy: logits_activations
	- loss_fn: reverse_kl
	- train_embeddings: True
	- learning_rate: 4e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: constant
	- num_epochs: 1.0

	### Resource Usage
	Peak GPU Memory: 19.8832 GB

	### Eval-Phase Metrics
	\| step \| epoch \| enwikippl \| frwikippl \| loss \| runtime \| samples_per_second \| steps_per_second \| zhwikippl \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| teacher eval \| \| 13.0697 \| 11.6518 \| \| \| \| \| 21.6262 \|
	\| 0 \| 0 \| 180187.8438 \| 182062.6875 \| 131.8108 \| 90.6539 \| 11.031 \| 2.758 \| 181762.375 \|
	\| 500 \| 0.0808 \| 14699.2041 \| 52797.9922 \| 6.0418 \| 90.8884 \| 11.003 \| 2.751 \| 371252.0312 \|
	\| 1000 \| 0.1616 \| 8812.4561 \| 47709.9297 \| 4.9882 \| 90.8533 \| 11.007 \| 2.752 \| 384212.3438 \|
	\| 1500 \| 0.2424 \| 7321.3081 \| 44922.375 \| 4.6195 \| 90.7179 \| 11.023 \| 2.756 \| 400192.5625 \|
	\| 2000 \| 0.3232 \| 6277.4165 \| 42254.6719 \| 4.2012 \| 90.8257 \| 11.01 \| 2.753 \| 423631.0938 \|
	\| 2500 \| 0.4040 \| 5452.0264 \| 39927.7812 \| 3.9955 \| 90.7803 \| 11.016 \| 2.754 \| 445022.5938 \|
	\| 3000 \| 0.4848 \| 4708.5049 \| 37660.8359 \| 3.7784 \| 90.8232 \| 11.01 \| 2.753 \| 447453.4375 \|
	\| 3500 \| 0.5657 \| 4329.6147 \| 35350.4805 \| 3.6816 \| 90.8654 \| 11.005 \| 2.751 \| 455292.8125 \|
	\| 4000 \| 0.6465 \| 3840.0864 \| 33493.6836 \| 3.5800 \| 90.7858 \| 11.015 \| 2.754 \| 446474.3125 \|
	\| 4500 \| 0.7273 \| 3495.4482 \| 31764.3340 \| 3.4447 \| 90.8083 \| 11.012 \| 2.753 \| 447611.3438 \|
	\| 5000 \| 0.8081 \| 3245.5376 \| 30812.8379 \| 3.3323 \| 90.7976 \| 11.014 \| 2.753 \| 448982.8438 \|
	\| 5500 \| 0.8889 \| 3057.9595 \| 29516.0742 \| 3.2926 \| 90.7385 \| 11.021 \| 2.755 \| 459842.8125 \|
	\| 6000 \| 0.9697 \| 2831.3643 \| 28517.0625 \| 3.1956 \| 90.7677 \| 11.017 \| 2.754 \| 441979.4375 \|
	\| 6187 \| 0.9999 \| 2760.3779 \| 28158.2578 \| 3.1654 \| 90.8509 \| 11.007 \| 2.752 \| 441247.4688 \|

	### Framework versions
	- Distily 0.1.0
	- Transformers 4.43.3
	- Pytorch 2.3.0
	- Datasets 2.20.0