lapp0
/

gpt2_model_card_distily_test

@@ -1,6 +1,7 @@
 ---
-license: mit
 base_model: gpt2
 tags:
 - generated_from_trainer
 model-index:
@@ -8,14 +9,23 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # gpt2_model_card_distily_test
-This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on an unknown dataset.
 It achieves the following results on the evaluation set:
-- Loss: 2480.0
 ## Model description
@@ -28,12 +38,16 @@ More information needed
 ## Training and evaluation data
 More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 0.0001
 - train_batch_size: 1
 - eval_batch_size: 2
@@ -42,19 +56,18 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: cosine
 - num_epochs: 1.0
-### Training results
-| Training Loss | Epoch  | Step | Validation Loss |
-|:-------------:|:------:|:----:|:---------------:|
-| No log        | 0      | 0    | 6944.0          |
-| 2707.0        | 0.2513 | 50   | 2672.0          |
-| 2531.5        | 0.5025 | 100  | 2528.0          |
-| 2324.5        | 0.7538 | 150  | 2480.0          |
 ### Framework versions
 - Transformers 4.43.3
 - Pytorch 2.3.0
 - Datasets 2.20.0
-- Tokenizers 0.19.1

 ---
 base_model: gpt2
+library_name: distily
+license: mit
 tags:
 - generated_from_trainer
 model-index:
   results: []
 ---
 # gpt2_model_card_distily_test
+This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).
+The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 16455.1230
+- eval_frwikippl: 38444.9648
+- eval_zhwikippl: 56717.4922
+- eval_loss: 0.0004
+- eval_runtime: 0.0554
+- eval_samples_per_second: 18.066
+- eval_steps_per_second: 18.066
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment.
 ## Model description
 ## Training and evaluation data
 More information needed
+-->
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_strategy: logits_activations
+- loss_fn: reverse_kl
+- train_embeddings: True
 - learning_rate: 0.0001
 - train_batch_size: 1
 - eval_batch_size: 2
 - lr_scheduler_type: cosine
 - num_epochs: 1.0
+### Resource Usage
+Peak GPU Memory: 1.2452 GB
+### Model Results
+| epoch | step | eval_enwikippl | eval_frwikippl | eval_loss | eval_runtime | eval_samples_per_second | eval_steps_per_second | eval_zhwikippl |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 0 | 0 | 63012.375 | 58568.7617 | 0.0042 | 0.076 | 13.155 | 13.155 | 62696.3008 |
+| 0.4040 | 40 | 20128.3281 | 41006.9219 | 0.0004 | 0.0553 | 18.079 | 18.079 | 58574.4609 |
+| 0.8081 | 80 | 16455.1230 | 38444.9648 | 0.0004 | 0.0554 | 18.066 | 18.066 | 56717.4922 |
 ### Framework versions
+- Distily 0.1.0
 - Transformers 4.43.3
 - Pytorch 2.3.0
 - Datasets 2.20.0