End of training
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ library_name: Distily
|
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
model-index:
|
7 |
-
- name: distily_TinyStories-
|
8 |
results: []
|
9 |
---
|
10 |
|
@@ -15,13 +15,13 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
|
|
15 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
16 |
|
17 |
It achieves the following results on the evaluation set:
|
18 |
-
- eval_enwikippl:
|
19 |
-
- eval_frwikippl:
|
20 |
-
- eval_zhwikippl:
|
21 |
-
- eval_loss:
|
22 |
-
- eval_runtime: 51.
|
23 |
-
- eval_samples_per_second: 48.
|
24 |
-
- eval_steps_per_second: 6.
|
25 |
|
26 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
27 |
should probably proofread and complete it, then remove this comment.
|
@@ -44,7 +44,7 @@ More information needed
|
|
44 |
### Training hyperparameters
|
45 |
|
46 |
The following hyperparameters were used during training:
|
47 |
-
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=
|
48 |
- train_embeddings: True
|
49 |
- learning_rate: 4e-05
|
50 |
- train_batch_size: 8
|
@@ -55,44 +55,44 @@ The following hyperparameters were used during training:
|
|
55 |
- num_epochs: 1.0
|
56 |
|
57 |
### Resource Usage
|
58 |
-
Peak GPU Memory: 8.
|
59 |
|
60 |
### Eval-Phase Metrics
|
61 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
|
62 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
63 |
| **teacher eval** | | 20633.1680 | 131577.2812 | | | | | 7615.4468 |
|
64 |
-
| 0 | 0 |
|
65 |
-
| 1000 | 0.0323 |
|
66 |
-
| 2000 | 0.0646 |
|
67 |
-
| 3000 | 0.0970 |
|
68 |
-
| 4000 | 0.1293 |
|
69 |
-
| 5000 | 0.1616 |
|
70 |
-
| 6000 | 0.1939 |
|
71 |
-
| 7000 | 0.2263 |
|
72 |
-
| 8000 | 0.2586 |
|
73 |
-
| 9000 | 0.2909 |
|
74 |
-
| 10000 | 0.3232 |
|
75 |
-
| 11000 | 0.3555 |
|
76 |
-
| 12000 | 0.3879 |
|
77 |
-
| 13000 | 0.4202 |
|
78 |
-
| 14000 | 0.4525 |
|
79 |
-
| 15000 | 0.4848 |
|
80 |
-
| 16000 | 0.5172 |
|
81 |
-
| 17000 | 0.5495 |
|
82 |
-
| 18000 | 0.5818 |
|
83 |
-
| 19000 | 0.6141 |
|
84 |
-
| 20000 | 0.6465 |
|
85 |
-
| 21000 | 0.6788 |
|
86 |
-
| 22000 | 0.7111 |
|
87 |
-
| 23000 | 0.7434 |
|
88 |
-
| 24000 | 0.7757 |
|
89 |
-
| 25000 | 0.8081 |
|
90 |
-
| 26000 | 0.8404 |
|
91 |
-
| 27000 | 0.8727 |
|
92 |
-
| 28000 | 0.9050 |
|
93 |
-
| 29000 | 0.9374 |
|
94 |
-
| 30000 | 0.9697 |
|
95 |
-
| 30938 | 1.0 |
|
96 |
|
97 |
### Framework versions
|
98 |
- Distily 0.2.0
|
|
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
model-index:
|
7 |
+
- name: distily_TinyStories-33M_hs_attn
|
8 |
results: []
|
9 |
---
|
10 |
|
|
|
15 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
16 |
|
17 |
It achieves the following results on the evaluation set:
|
18 |
+
- eval_enwikippl: 5505.2720
|
19 |
+
- eval_frwikippl: 21773.6699
|
20 |
+
- eval_zhwikippl: 149216.0938
|
21 |
+
- eval_loss: 1.1383
|
22 |
+
- eval_runtime: 51.1413
|
23 |
+
- eval_samples_per_second: 48.884
|
24 |
+
- eval_steps_per_second: 6.12
|
25 |
|
26 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
27 |
should probably proofread and complete it, then remove this comment.
|
|
|
44 |
### Training hyperparameters
|
45 |
|
46 |
The following hyperparameters were used during training:
|
47 |
+
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=5000.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=500.0, loss_fn=jsd, layer_mapper=None, projector=None))
|
48 |
- train_embeddings: True
|
49 |
- learning_rate: 4e-05
|
50 |
- train_batch_size: 8
|
|
|
55 |
- num_epochs: 1.0
|
56 |
|
57 |
### Resource Usage
|
58 |
+
Peak GPU Memory: 8.2949 GB
|
59 |
|
60 |
### Eval-Phase Metrics
|
61 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
|
62 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
63 |
| **teacher eval** | | 20633.1680 | 131577.2812 | | | | | 7615.4468 |
|
64 |
+
| 0 | 0 | 57409.7656 | 57878.0820 | 11.7972 | 40.6672 | 61.475 | 7.697 | 56928.0781 |
|
65 |
+
| 1000 | 0.0323 | 10372.9512 | 76930.4531 | 1.9053 | 41.7953 | 59.815 | 7.489 | 858113.625 |
|
66 |
+
| 2000 | 0.0646 | 8020.6040 | 46711.9688 | 1.6472 | 41.0642 | 60.88 | 7.622 | 367518.3125 |
|
67 |
+
| 3000 | 0.0970 | 8157.5376 | 45240.3945 | 1.5278 | 45.4508 | 55.005 | 6.887 | 515510.5625 |
|
68 |
+
| 4000 | 0.1293 | 7411.5596 | 36822.6484 | 1.4337 | 51.1158 | 48.909 | 6.123 | 421034.4688 |
|
69 |
+
| 5000 | 0.1616 | 6422.7583 | 28339.4023 | 1.3515 | 51.1748 | 48.852 | 6.116 | 267027.4375 |
|
70 |
+
| 6000 | 0.1939 | 6131.3276 | 24695.6113 | 1.2750 | 50.9734 | 49.045 | 6.14 | 194273.2656 |
|
71 |
+
| 7000 | 0.2263 | 5802.4341 | 23374.1562 | 1.2199 | 50.8571 | 49.157 | 6.155 | 168406.4688 |
|
72 |
+
| 8000 | 0.2586 | 5621.9170 | 21168.1855 | 1.1773 | 51.0097 | 49.01 | 6.136 | 164012.0469 |
|
73 |
+
| 9000 | 0.2909 | 5505.2720 | 21773.6699 | 1.1383 | 51.1413 | 48.884 | 6.12 | 149216.0938 |
|
74 |
+
| 10000 | 0.3232 | 5617.5493 | 21623.7461 | 1.1134 | 51.0853 | 48.938 | 6.127 | 148977.0625 |
|
75 |
+
| 11000 | 0.3555 | 5438.9810 | 21305.9277 | 1.0901 | 51.2289 | 48.801 | 6.11 | 148262.7188 |
|
76 |
+
| 12000 | 0.3879 | 5601.4360 | 22292.5059 | 1.0718 | 51.1771 | 48.85 | 6.116 | 156941.4062 |
|
77 |
+
| 13000 | 0.4202 | 5323.2368 | 21323.9785 | 1.0547 | 50.814 | 49.199 | 6.16 | 145089.7812 |
|
78 |
+
| 14000 | 0.4525 | 5399.0068 | 21468.7930 | 1.0443 | 50.9066 | 49.11 | 6.149 | 147118.75 |
|
79 |
+
| 15000 | 0.4848 | 5341.0449 | 20151.6465 | 1.0364 | 51.0013 | 49.018 | 6.137 | 134312.3438 |
|
80 |
+
| 16000 | 0.5172 | 5234.6987 | 20021.3477 | 1.0292 | 51.7235 | 48.334 | 6.051 | 136299.75 |
|
81 |
+
| 17000 | 0.5495 | 5317.8687 | 21308.9355 | 1.0156 | 54.7044 | 45.7 | 5.722 | 149495.2656 |
|
82 |
+
| 18000 | 0.5818 | 5521.5405 | 20827.6855 | 1.0137 | 41.4159 | 60.363 | 7.557 | 141984.7344 |
|
83 |
+
| 19000 | 0.6141 | 5249.7568 | 20254.2051 | 1.0055 | 42.1847 | 59.263 | 7.42 | 124202.625 |
|
84 |
+
| 20000 | 0.6465 | 5582.7598 | 21764.4727 | 0.9982 | 46.3033 | 53.992 | 6.76 | 149495.2656 |
|
85 |
+
| 21000 | 0.6788 | 5232.6621 | 20262.7637 | 0.9935 | 48.1287 | 51.944 | 6.503 | 145128.5312 |
|
86 |
+
| 22000 | 0.7111 | 5320.3491 | 21332.9902 | 0.9854 | 50.6681 | 49.341 | 6.177 | 155605.7656 |
|
87 |
+
| 23000 | 0.7434 | 5032.2212 | 19788.3945 | 0.9876 | 50.9899 | 49.029 | 6.138 | 141417.0312 |
|
88 |
+
| 24000 | 0.7757 | 5318.2793 | 22064.2031 | 0.9832 | 50.912 | 49.104 | 6.148 | 152560.7188 |
|
89 |
+
| 25000 | 0.8081 | 5365.5708 | 21906.0957 | 0.9779 | 51.1379 | 48.887 | 6.121 | 154034.5156 |
|
90 |
+
| 26000 | 0.8404 | 5328.6157 | 22267.3691 | 0.9740 | 51.1115 | 48.913 | 6.124 | 154983.75 |
|
91 |
+
| 27000 | 0.8727 | 5565.8813 | 22663.3496 | 0.9714 | 32.781 | 76.264 | 9.548 | 152397.8594 |
|
92 |
+
| 28000 | 0.9050 | 5278.7847 | 20380.2637 | 0.9723 | 27.108 | 92.224 | 11.546 | 141190.6406 |
|
93 |
+
| 29000 | 0.9374 | 5302.2002 | 20637.6562 | 0.9657 | 30.8728 | 80.977 | 10.138 | 139914.2969 |
|
94 |
+
| 30000 | 0.9697 | 5366.4053 | 22920.4629 | 0.9633 | 27.0433 | 92.444 | 11.574 | 160202.3281 |
|
95 |
+
| 30938 | 1.0 | 5286.9868 | 20498.4277 | 0.9628 | 27.0346 | 92.474 | 11.578 | 145051.0469 |
|
96 |
|
97 |
### Framework versions
|
98 |
- Distily 0.2.0
|
runs/Aug15_11-10-45_77e473d64567/events.out.tfevents.1723731347.77e473d64567
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6bb4b6bd610761dd2cdd28be0d99c6dc3abb383683d8b94a8978002fe5798e5d
|
3 |
+
size 253
|