lapp0 commited on
Commit
b67d23b
·
verified ·
1 Parent(s): 1bd9938

Training in progress, step 199

Browse files
Files changed (41) hide show
  1. README.md +0 -68
  2. gpt2_model_card_distily_test/README.md +68 -0
  3. gpt2_model_card_distily_test/checkpoint-500/config.json +54 -0
  4. gpt2_model_card_distily_test/checkpoint-500/generation_config.json +6 -0
  5. gpt2_model_card_distily_test/checkpoint-500/merges.txt +0 -0
  6. gpt2_model_card_distily_test/checkpoint-500/model.safetensors +3 -0
  7. gpt2_model_card_distily_test/checkpoint-500/optimizer.pt +3 -0
  8. gpt2_model_card_distily_test/checkpoint-500/rng_state.pth +3 -0
  9. gpt2_model_card_distily_test/checkpoint-500/scheduler.pt +3 -0
  10. gpt2_model_card_distily_test/checkpoint-500/special_tokens_map.json +6 -0
  11. gpt2_model_card_distily_test/checkpoint-500/tokenizer.json +0 -0
  12. gpt2_model_card_distily_test/checkpoint-500/tokenizer_config.json +20 -0
  13. gpt2_model_card_distily_test/checkpoint-500/trainer_state.json +295 -0
  14. gpt2_model_card_distily_test/checkpoint-500/training_args.bin +3 -0
  15. gpt2_model_card_distily_test/checkpoint-500/vocab.json +0 -0
  16. gpt2_model_card_distily_test/checkpoint-999/config.json +54 -0
  17. gpt2_model_card_distily_test/checkpoint-999/generation_config.json +6 -0
  18. gpt2_model_card_distily_test/checkpoint-999/merges.txt +0 -0
  19. gpt2_model_card_distily_test/checkpoint-999/model.safetensors +3 -0
  20. gpt2_model_card_distily_test/checkpoint-999/optimizer.pt +3 -0
  21. gpt2_model_card_distily_test/checkpoint-999/rng_state.pth +3 -0
  22. gpt2_model_card_distily_test/checkpoint-999/scheduler.pt +3 -0
  23. gpt2_model_card_distily_test/checkpoint-999/special_tokens_map.json +6 -0
  24. gpt2_model_card_distily_test/checkpoint-999/tokenizer.json +0 -0
  25. gpt2_model_card_distily_test/checkpoint-999/tokenizer_config.json +20 -0
  26. gpt2_model_card_distily_test/checkpoint-999/trainer_state.json +542 -0
  27. gpt2_model_card_distily_test/checkpoint-999/training_args.bin +3 -0
  28. gpt2_model_card_distily_test/checkpoint-999/vocab.json +0 -0
  29. gpt2_model_card_distily_test/config.json +54 -0
  30. gpt2_model_card_distily_test/generation_config.json +6 -0
  31. gpt2_model_card_distily_test/merges.txt +0 -0
  32. gpt2_model_card_distily_test/model.safetensors +3 -0
  33. gpt2_model_card_distily_test/runs/Aug05_20-55-15_232a0f8c3879/events.out.tfevents.1722891394.232a0f8c3879 +3 -0
  34. gpt2_model_card_distily_test/special_tokens_map.json +6 -0
  35. gpt2_model_card_distily_test/tokenizer.json +0 -0
  36. gpt2_model_card_distily_test/tokenizer_config.json +20 -0
  37. gpt2_model_card_distily_test/training_args.bin +3 -0
  38. gpt2_model_card_distily_test/vocab.json +0 -0
  39. model.safetensors +1 -1
  40. runs/Aug05_21-11-07_232a0f8c3879/events.out.tfevents.1722892417.232a0f8c3879 +3 -0
  41. training_args.bin +1 -1
README.md CHANGED
@@ -1,68 +0,0 @@
1
- ---
2
- base_model: gpt2
3
- library_name: distily
4
- license: mit
5
- tags:
6
- - Distily
7
- - generated_from_trainer
8
- model-index:
9
- - name: gpt2_model_card_distily_test
10
- results: []
11
- ---
12
-
13
- # gpt2_model_card_distily_test
14
-
15
- This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).
16
-
17
- The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
18
-
19
- It achieves the following results on the evaluation set:
20
- - train_loss: 2109.4855
21
-
22
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
23
- should probably proofread and complete it, then remove this comment.
24
-
25
- ## Model description
26
-
27
- More information needed
28
-
29
- ## Intended uses & limitations
30
-
31
- More information needed
32
-
33
- ## Training and evaluation data
34
-
35
- More information needed
36
- -->
37
-
38
- ## Training procedure
39
-
40
- ### Training hyperparameters
41
-
42
- The following hyperparameters were used during training:
43
- - distillation_strategy: logits_activations
44
- - loss_fn: reverse_kl
45
- - train_embeddings: True
46
- - learning_rate: 0.0001
47
- - train_batch_size: 1
48
- - eval_batch_size: 2
49
- - seed: 42
50
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
51
- - lr_scheduler_type: cosine
52
- - num_epochs: 1.0
53
-
54
- ### Model Results
55
- | epoch | eval_enwikippl | eval_frwikippl | eval_loss | eval_runtime | eval_samples_per_second | eval_steps_per_second | eval_zhwikippl | step | train_loss |
56
- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
57
- | 0 | 61518.3633 | 57357.1172 | 7104.0 | 0.1065 | 9.388 | 9.388 | 60678.2734 | 0 | |
58
- | 0.2002002002002002 | 1984.4683 | 9672.7939 | 2192.0 | 0.0547 | 18.295 | 18.295 | 121910.375 | 200 | |
59
- | 0.4004004004004004 | 1589.3818 | 7626.9956 | 2048.0 | 0.0545 | 18.334 | 18.334 | 74891.5859 | 400 | |
60
- | 0.6006006006006006 | 1461.5446 | 7612.6294 | 1968.0 | 0.0554 | 18.063 | 18.063 | 75592.3516 | 600 | |
61
- | 0.8008008008008008 | 1401.9131 | 7065.2969 | 1960.0 | 0.0547 | 18.283 | 18.283 | 59395.5664 | 800 | |
62
- | | | | | | | | | | 2109.4855 |
63
-
64
- ### Framework versions
65
- - Distily 0.1.0
66
- - Transformers 4.43.3
67
- - Pytorch 2.3.0
68
- - Datasets 2.20.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
gpt2_model_card_distily_test/README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: gpt2
3
+ library_name: distily
4
+ license: mit
5
+ tags:
6
+ - Distily
7
+ - generated_from_trainer
8
+ model-index:
9
+ - name: gpt2_model_card_distily_test
10
+ results: []
11
+ ---
12
+
13
+ # gpt2_model_card_distily_test
14
+
15
+ This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).
16
+
17
+ The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
18
+
19
+ It achieves the following results on the evaluation set:
20
+ - train_loss: 2109.4855
21
+
22
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
23
+ should probably proofread and complete it, then remove this comment.
24
+
25
+ ## Model description
26
+
27
+ More information needed
28
+
29
+ ## Intended uses & limitations
30
+
31
+ More information needed
32
+
33
+ ## Training and evaluation data
34
+
35
+ More information needed
36
+ -->
37
+
38
+ ## Training procedure
39
+
40
+ ### Training hyperparameters
41
+
42
+ The following hyperparameters were used during training:
43
+ - distillation_strategy: logits_activations
44
+ - loss_fn: reverse_kl
45
+ - train_embeddings: True
46
+ - learning_rate: 0.0001
47
+ - train_batch_size: 1
48
+ - eval_batch_size: 2
49
+ - seed: 42
50
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
51
+ - lr_scheduler_type: cosine
52
+ - num_epochs: 1.0
53
+
54
+ ### Model Results
55
+ | epoch | eval_enwikippl | eval_frwikippl | eval_loss | eval_runtime | eval_samples_per_second | eval_steps_per_second | eval_zhwikippl | step | train_loss |
56
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
57
+ | 0 | 61518.3633 | 57357.1172 | 7104.0 | 0.1065 | 9.388 | 9.388 | 60678.2734 | 0 | |
58
+ | 0.2002002002002002 | 1984.4683 | 9672.7939 | 2192.0 | 0.0547 | 18.295 | 18.295 | 121910.375 | 200 | |
59
+ | 0.4004004004004004 | 1589.3818 | 7626.9956 | 2048.0 | 0.0545 | 18.334 | 18.334 | 74891.5859 | 400 | |
60
+ | 0.6006006006006006 | 1461.5446 | 7612.6294 | 1968.0 | 0.0554 | 18.063 | 18.063 | 75592.3516 | 600 | |
61
+ | 0.8008008008008008 | 1401.9131 | 7065.2969 | 1960.0 | 0.0547 | 18.283 | 18.283 | 59395.5664 | 800 | |
62
+ | | | | | | | | | | 2109.4855 |
63
+
64
+ ### Framework versions
65
+ - Distily 0.1.0
66
+ - Transformers 4.43.3
67
+ - Pytorch 2.3.0
68
+ - Datasets 2.20.0
gpt2_model_card_distily_test/checkpoint-500/config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gpt2",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "quantization_config": {
21
+ "_load_in_4bit": false,
22
+ "_load_in_8bit": true,
23
+ "bnb_4bit_compute_dtype": "float32",
24
+ "bnb_4bit_quant_storage": "uint8",
25
+ "bnb_4bit_quant_type": "fp4",
26
+ "bnb_4bit_use_double_quant": false,
27
+ "llm_int8_enable_fp32_cpu_offload": false,
28
+ "llm_int8_has_fp16_weight": false,
29
+ "llm_int8_skip_modules": null,
30
+ "llm_int8_threshold": 6.0,
31
+ "load_in_4bit": false,
32
+ "load_in_8bit": true,
33
+ "quant_method": "bitsandbytes"
34
+ },
35
+ "reorder_and_upcast_attn": false,
36
+ "resid_pdrop": 0.1,
37
+ "scale_attn_by_inverse_layer_idx": false,
38
+ "scale_attn_weights": true,
39
+ "summary_activation": null,
40
+ "summary_first_dropout": 0.1,
41
+ "summary_proj_to_labels": true,
42
+ "summary_type": "cls_index",
43
+ "summary_use_proj": true,
44
+ "task_specific_params": {
45
+ "text-generation": {
46
+ "do_sample": true,
47
+ "max_length": 50
48
+ }
49
+ },
50
+ "torch_dtype": "bfloat16",
51
+ "transformers_version": "4.43.3",
52
+ "use_cache": true,
53
+ "vocab_size": 50257
54
+ }
gpt2_model_card_distily_test/checkpoint-500/generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.43.3"
6
+ }
gpt2_model_card_distily_test/checkpoint-500/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/checkpoint-500/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6cccbaeb4487bff2af185b2683f0ede2f67c000706ae171f4b693527c0be218c
3
+ size 248894656
gpt2_model_card_distily_test/checkpoint-500/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd730ec566a05a71aefd1f7eeb8966510fba8311af13e54f8a9e5fd6798d3ad1
3
+ size 995606906
gpt2_model_card_distily_test/checkpoint-500/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b782f3f9b47529063fd6bac5e25a0d7000e9d78436d9804f16ae2026fdabcddb
3
+ size 14244
gpt2_model_card_distily_test/checkpoint-500/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f68840a1aaba999aecaf9807369438e206b80288ec6a19259f834e337fed2d5b
3
+ size 1064
gpt2_model_card_distily_test/checkpoint-500/special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
gpt2_model_card_distily_test/checkpoint-500/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/checkpoint-500/tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "model_max_length": 1024,
17
+ "pad_token": "<|endoftext|>",
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
gpt2_model_card_distily_test/checkpoint-500/trainer_state.json ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.5005005005005005,
5
+ "eval_steps": 200,
6
+ "global_step": 500,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0,
13
+ "eval_enwikippl": 61518.36328125,
14
+ "eval_frwikippl": 57357.1171875,
15
+ "eval_zhwikippl": 60678.2734375,
16
+ "step": 0
17
+ },
18
+ {
19
+ "epoch": 0,
20
+ "eval_loss": 7104.0,
21
+ "eval_runtime": 0.1065,
22
+ "eval_samples_per_second": 9.388,
23
+ "eval_steps_per_second": 9.388,
24
+ "step": 0
25
+ },
26
+ {
27
+ "epoch": 0.016016016016016016,
28
+ "grad_norm": 3328.0,
29
+ "learning_rate": 9.993672136294003e-05,
30
+ "loss": 4250.0,
31
+ "step": 16
32
+ },
33
+ {
34
+ "epoch": 0.03203203203203203,
35
+ "grad_norm": 1240.0,
36
+ "learning_rate": 9.974704561919644e-05,
37
+ "loss": 3010.0,
38
+ "step": 32
39
+ },
40
+ {
41
+ "epoch": 0.04804804804804805,
42
+ "grad_norm": 1776.0,
43
+ "learning_rate": 9.943145286567114e-05,
44
+ "loss": 2740.0,
45
+ "step": 48
46
+ },
47
+ {
48
+ "epoch": 0.06406406406406406,
49
+ "grad_norm": 1864.0,
50
+ "learning_rate": 9.899074191353648e-05,
51
+ "loss": 2546.5,
52
+ "step": 64
53
+ },
54
+ {
55
+ "epoch": 0.08008008008008008,
56
+ "grad_norm": 1504.0,
57
+ "learning_rate": 9.8426028266328e-05,
58
+ "loss": 2410.0,
59
+ "step": 80
60
+ },
61
+ {
62
+ "epoch": 0.0960960960960961,
63
+ "grad_norm": 2008.0,
64
+ "learning_rate": 9.773874129644268e-05,
65
+ "loss": 2328.0,
66
+ "step": 96
67
+ },
68
+ {
69
+ "epoch": 0.11211211211211211,
70
+ "grad_norm": 1256.0,
71
+ "learning_rate": 9.693062062718947e-05,
72
+ "loss": 2412.5,
73
+ "step": 112
74
+ },
75
+ {
76
+ "epoch": 0.12812812812812813,
77
+ "grad_norm": 1456.0,
78
+ "learning_rate": 9.600371172954957e-05,
79
+ "loss": 2346.0,
80
+ "step": 128
81
+ },
82
+ {
83
+ "epoch": 0.14414414414414414,
84
+ "grad_norm": 1112.0,
85
+ "learning_rate": 9.496036074479184e-05,
86
+ "loss": 2235.75,
87
+ "step": 144
88
+ },
89
+ {
90
+ "epoch": 0.16016016016016016,
91
+ "grad_norm": 1160.0,
92
+ "learning_rate": 9.380320854604791e-05,
93
+ "loss": 2376.0,
94
+ "step": 160
95
+ },
96
+ {
97
+ "epoch": 0.17617617617617617,
98
+ "grad_norm": 988.0,
99
+ "learning_rate": 9.253518405387808e-05,
100
+ "loss": 2175.0,
101
+ "step": 176
102
+ },
103
+ {
104
+ "epoch": 0.1921921921921922,
105
+ "grad_norm": 984.0,
106
+ "learning_rate": 9.115949682274728e-05,
107
+ "loss": 2339.5,
108
+ "step": 192
109
+ },
110
+ {
111
+ "epoch": 0.2002002002002002,
112
+ "eval_enwikippl": 1984.46826171875,
113
+ "eval_frwikippl": 9672.7939453125,
114
+ "eval_zhwikippl": 121910.375,
115
+ "step": 200
116
+ },
117
+ {
118
+ "epoch": 0.2002002002002002,
119
+ "eval_loss": 2192.0,
120
+ "eval_runtime": 0.0547,
121
+ "eval_samples_per_second": 18.295,
122
+ "eval_steps_per_second": 18.295,
123
+ "step": 200
124
+ },
125
+ {
126
+ "epoch": 0.2082082082082082,
127
+ "grad_norm": 852.0,
128
+ "learning_rate": 8.967962891717575e-05,
129
+ "loss": 2288.5,
130
+ "step": 208
131
+ },
132
+ {
133
+ "epoch": 0.22422422422422422,
134
+ "grad_norm": 1232.0,
135
+ "learning_rate": 8.809932609812726e-05,
136
+ "loss": 2241.25,
137
+ "step": 224
138
+ },
139
+ {
140
+ "epoch": 0.24024024024024024,
141
+ "grad_norm": 764.0,
142
+ "learning_rate": 8.642258834194306e-05,
143
+ "loss": 2053.75,
144
+ "step": 240
145
+ },
146
+ {
147
+ "epoch": 0.25625625625625625,
148
+ "grad_norm": 796.0,
149
+ "learning_rate": 8.465365971581986e-05,
150
+ "loss": 2200.0,
151
+ "step": 256
152
+ },
153
+ {
154
+ "epoch": 0.2722722722722723,
155
+ "grad_norm": 1272.0,
156
+ "learning_rate": 8.279701763545837e-05,
157
+ "loss": 1987.25,
158
+ "step": 272
159
+ },
160
+ {
161
+ "epoch": 0.2882882882882883,
162
+ "grad_norm": 744.0,
163
+ "learning_rate": 8.085736153207277e-05,
164
+ "loss": 2096.75,
165
+ "step": 288
166
+ },
167
+ {
168
+ "epoch": 0.30430430430430433,
169
+ "grad_norm": 696.0,
170
+ "learning_rate": 7.88396009574465e-05,
171
+ "loss": 2164.5,
172
+ "step": 304
173
+ },
174
+ {
175
+ "epoch": 0.3203203203203203,
176
+ "grad_norm": 1208.0,
177
+ "learning_rate": 7.674884315714259e-05,
178
+ "loss": 2115.25,
179
+ "step": 320
180
+ },
181
+ {
182
+ "epoch": 0.33633633633633636,
183
+ "grad_norm": 418.0,
184
+ "learning_rate": 7.45903801433221e-05,
185
+ "loss": 1905.5,
186
+ "step": 336
187
+ },
188
+ {
189
+ "epoch": 0.35235235235235235,
190
+ "grad_norm": 688.0,
191
+ "learning_rate": 7.236967529989135e-05,
192
+ "loss": 2075.5,
193
+ "step": 352
194
+ },
195
+ {
196
+ "epoch": 0.3683683683683684,
197
+ "grad_norm": 1064.0,
198
+ "learning_rate": 7.009234955388256e-05,
199
+ "loss": 2129.0,
200
+ "step": 368
201
+ },
202
+ {
203
+ "epoch": 0.3843843843843844,
204
+ "grad_norm": 450.0,
205
+ "learning_rate": 6.776416714806969e-05,
206
+ "loss": 1712.75,
207
+ "step": 384
208
+ },
209
+ {
210
+ "epoch": 0.4004004004004004,
211
+ "grad_norm": 792.0,
212
+ "learning_rate": 6.539102105083139e-05,
213
+ "loss": 2101.75,
214
+ "step": 400
215
+ },
216
+ {
217
+ "epoch": 0.4004004004004004,
218
+ "eval_enwikippl": 1589.3818359375,
219
+ "eval_frwikippl": 7626.99560546875,
220
+ "eval_zhwikippl": 74891.5859375,
221
+ "step": 400
222
+ },
223
+ {
224
+ "epoch": 0.4004004004004004,
225
+ "eval_loss": 2048.0,
226
+ "eval_runtime": 0.0545,
227
+ "eval_samples_per_second": 18.334,
228
+ "eval_steps_per_second": 18.334,
229
+ "step": 400
230
+ },
231
+ {
232
+ "epoch": 0.4164164164164164,
233
+ "grad_norm": 1272.0,
234
+ "learning_rate": 6.297891804019078e-05,
235
+ "loss": 1900.5,
236
+ "step": 416
237
+ },
238
+ {
239
+ "epoch": 0.43243243243243246,
240
+ "grad_norm": 608.0,
241
+ "learning_rate": 6.0533963499786314e-05,
242
+ "loss": 2183.5,
243
+ "step": 432
244
+ },
245
+ {
246
+ "epoch": 0.44844844844844844,
247
+ "grad_norm": 972.0,
248
+ "learning_rate": 5.806234596525762e-05,
249
+ "loss": 2035.25,
250
+ "step": 448
251
+ },
252
+ {
253
+ "epoch": 0.4644644644644645,
254
+ "grad_norm": 1376.0,
255
+ "learning_rate": 5.557032146016141e-05,
256
+ "loss": 1959.25,
257
+ "step": 464
258
+ },
259
+ {
260
+ "epoch": 0.4804804804804805,
261
+ "grad_norm": 772.0,
262
+ "learning_rate": 5.306419766106582e-05,
263
+ "loss": 1929.25,
264
+ "step": 480
265
+ },
266
+ {
267
+ "epoch": 0.4964964964964965,
268
+ "grad_norm": 880.0,
269
+ "learning_rate": 5.055031793190323e-05,
270
+ "loss": 1926.5,
271
+ "step": 496
272
+ }
273
+ ],
274
+ "logging_steps": 16,
275
+ "max_steps": 999,
276
+ "num_input_tokens_seen": 0,
277
+ "num_train_epochs": 1,
278
+ "save_steps": 500,
279
+ "stateful_callbacks": {
280
+ "TrainerControl": {
281
+ "args": {
282
+ "should_epoch_stop": false,
283
+ "should_evaluate": false,
284
+ "should_log": false,
285
+ "should_save": true,
286
+ "should_training_stop": false
287
+ },
288
+ "attributes": {}
289
+ }
290
+ },
291
+ "total_flos": 261292032000000.0,
292
+ "train_batch_size": 1,
293
+ "trial_name": null,
294
+ "trial_params": null
295
+ }
gpt2_model_card_distily_test/checkpoint-500/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3de33db1e43c0c23c28487ed3633e712616641870c7cc2ce0241e293a6c76792
3
+ size 907106628
gpt2_model_card_distily_test/checkpoint-500/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/checkpoint-999/config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gpt2",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "quantization_config": {
21
+ "_load_in_4bit": false,
22
+ "_load_in_8bit": true,
23
+ "bnb_4bit_compute_dtype": "float32",
24
+ "bnb_4bit_quant_storage": "uint8",
25
+ "bnb_4bit_quant_type": "fp4",
26
+ "bnb_4bit_use_double_quant": false,
27
+ "llm_int8_enable_fp32_cpu_offload": false,
28
+ "llm_int8_has_fp16_weight": false,
29
+ "llm_int8_skip_modules": null,
30
+ "llm_int8_threshold": 6.0,
31
+ "load_in_4bit": false,
32
+ "load_in_8bit": true,
33
+ "quant_method": "bitsandbytes"
34
+ },
35
+ "reorder_and_upcast_attn": false,
36
+ "resid_pdrop": 0.1,
37
+ "scale_attn_by_inverse_layer_idx": false,
38
+ "scale_attn_weights": true,
39
+ "summary_activation": null,
40
+ "summary_first_dropout": 0.1,
41
+ "summary_proj_to_labels": true,
42
+ "summary_type": "cls_index",
43
+ "summary_use_proj": true,
44
+ "task_specific_params": {
45
+ "text-generation": {
46
+ "do_sample": true,
47
+ "max_length": 50
48
+ }
49
+ },
50
+ "torch_dtype": "bfloat16",
51
+ "transformers_version": "4.43.3",
52
+ "use_cache": true,
53
+ "vocab_size": 50257
54
+ }
gpt2_model_card_distily_test/checkpoint-999/generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.43.3"
6
+ }
gpt2_model_card_distily_test/checkpoint-999/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/checkpoint-999/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d03342bb9f778d0d92073e8be4e66a26e6e958a0d14aeabb0cb60e8916421b3f
3
+ size 248894656
gpt2_model_card_distily_test/checkpoint-999/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7004c27961d0274e546ba2fbbf04bf733ab3dc09cb104e49a999b5542668f29b
3
+ size 995606906
gpt2_model_card_distily_test/checkpoint-999/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a11d700b671c173ca256b19d0266052b0a389a76dbe427f294f0b31b2cb7f5d3
3
+ size 14244
gpt2_model_card_distily_test/checkpoint-999/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdccb8c519e194556c3d0c4f56e9b8a95d6741ddbc10302afbb31bb28501d358
3
+ size 1064
gpt2_model_card_distily_test/checkpoint-999/special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
gpt2_model_card_distily_test/checkpoint-999/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/checkpoint-999/tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "model_max_length": 1024,
17
+ "pad_token": "<|endoftext|>",
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
gpt2_model_card_distily_test/checkpoint-999/trainer_state.json ADDED
@@ -0,0 +1,542 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
+ "eval_steps": 200,
6
+ "global_step": 999,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0,
13
+ "eval_enwikippl": 61518.36328125,
14
+ "eval_frwikippl": 57357.1171875,
15
+ "eval_zhwikippl": 60678.2734375,
16
+ "step": 0
17
+ },
18
+ {
19
+ "epoch": 0,
20
+ "eval_loss": 7104.0,
21
+ "eval_runtime": 0.1065,
22
+ "eval_samples_per_second": 9.388,
23
+ "eval_steps_per_second": 9.388,
24
+ "step": 0
25
+ },
26
+ {
27
+ "epoch": 0.016016016016016016,
28
+ "grad_norm": 3328.0,
29
+ "learning_rate": 9.993672136294003e-05,
30
+ "loss": 4250.0,
31
+ "step": 16
32
+ },
33
+ {
34
+ "epoch": 0.03203203203203203,
35
+ "grad_norm": 1240.0,
36
+ "learning_rate": 9.974704561919644e-05,
37
+ "loss": 3010.0,
38
+ "step": 32
39
+ },
40
+ {
41
+ "epoch": 0.04804804804804805,
42
+ "grad_norm": 1776.0,
43
+ "learning_rate": 9.943145286567114e-05,
44
+ "loss": 2740.0,
45
+ "step": 48
46
+ },
47
+ {
48
+ "epoch": 0.06406406406406406,
49
+ "grad_norm": 1864.0,
50
+ "learning_rate": 9.899074191353648e-05,
51
+ "loss": 2546.5,
52
+ "step": 64
53
+ },
54
+ {
55
+ "epoch": 0.08008008008008008,
56
+ "grad_norm": 1504.0,
57
+ "learning_rate": 9.8426028266328e-05,
58
+ "loss": 2410.0,
59
+ "step": 80
60
+ },
61
+ {
62
+ "epoch": 0.0960960960960961,
63
+ "grad_norm": 2008.0,
64
+ "learning_rate": 9.773874129644268e-05,
65
+ "loss": 2328.0,
66
+ "step": 96
67
+ },
68
+ {
69
+ "epoch": 0.11211211211211211,
70
+ "grad_norm": 1256.0,
71
+ "learning_rate": 9.693062062718947e-05,
72
+ "loss": 2412.5,
73
+ "step": 112
74
+ },
75
+ {
76
+ "epoch": 0.12812812812812813,
77
+ "grad_norm": 1456.0,
78
+ "learning_rate": 9.600371172954957e-05,
79
+ "loss": 2346.0,
80
+ "step": 128
81
+ },
82
+ {
83
+ "epoch": 0.14414414414414414,
84
+ "grad_norm": 1112.0,
85
+ "learning_rate": 9.496036074479184e-05,
86
+ "loss": 2235.75,
87
+ "step": 144
88
+ },
89
+ {
90
+ "epoch": 0.16016016016016016,
91
+ "grad_norm": 1160.0,
92
+ "learning_rate": 9.380320854604791e-05,
93
+ "loss": 2376.0,
94
+ "step": 160
95
+ },
96
+ {
97
+ "epoch": 0.17617617617617617,
98
+ "grad_norm": 988.0,
99
+ "learning_rate": 9.253518405387808e-05,
100
+ "loss": 2175.0,
101
+ "step": 176
102
+ },
103
+ {
104
+ "epoch": 0.1921921921921922,
105
+ "grad_norm": 984.0,
106
+ "learning_rate": 9.115949682274728e-05,
107
+ "loss": 2339.5,
108
+ "step": 192
109
+ },
110
+ {
111
+ "epoch": 0.2002002002002002,
112
+ "eval_enwikippl": 1984.46826171875,
113
+ "eval_frwikippl": 9672.7939453125,
114
+ "eval_zhwikippl": 121910.375,
115
+ "step": 200
116
+ },
117
+ {
118
+ "epoch": 0.2002002002002002,
119
+ "eval_loss": 2192.0,
120
+ "eval_runtime": 0.0547,
121
+ "eval_samples_per_second": 18.295,
122
+ "eval_steps_per_second": 18.295,
123
+ "step": 200
124
+ },
125
+ {
126
+ "epoch": 0.2082082082082082,
127
+ "grad_norm": 852.0,
128
+ "learning_rate": 8.967962891717575e-05,
129
+ "loss": 2288.5,
130
+ "step": 208
131
+ },
132
+ {
133
+ "epoch": 0.22422422422422422,
134
+ "grad_norm": 1232.0,
135
+ "learning_rate": 8.809932609812726e-05,
136
+ "loss": 2241.25,
137
+ "step": 224
138
+ },
139
+ {
140
+ "epoch": 0.24024024024024024,
141
+ "grad_norm": 764.0,
142
+ "learning_rate": 8.642258834194306e-05,
143
+ "loss": 2053.75,
144
+ "step": 240
145
+ },
146
+ {
147
+ "epoch": 0.25625625625625625,
148
+ "grad_norm": 796.0,
149
+ "learning_rate": 8.465365971581986e-05,
150
+ "loss": 2200.0,
151
+ "step": 256
152
+ },
153
+ {
154
+ "epoch": 0.2722722722722723,
155
+ "grad_norm": 1272.0,
156
+ "learning_rate": 8.279701763545837e-05,
157
+ "loss": 1987.25,
158
+ "step": 272
159
+ },
160
+ {
161
+ "epoch": 0.2882882882882883,
162
+ "grad_norm": 744.0,
163
+ "learning_rate": 8.085736153207277e-05,
164
+ "loss": 2096.75,
165
+ "step": 288
166
+ },
167
+ {
168
+ "epoch": 0.30430430430430433,
169
+ "grad_norm": 696.0,
170
+ "learning_rate": 7.88396009574465e-05,
171
+ "loss": 2164.5,
172
+ "step": 304
173
+ },
174
+ {
175
+ "epoch": 0.3203203203203203,
176
+ "grad_norm": 1208.0,
177
+ "learning_rate": 7.674884315714259e-05,
178
+ "loss": 2115.25,
179
+ "step": 320
180
+ },
181
+ {
182
+ "epoch": 0.33633633633633636,
183
+ "grad_norm": 418.0,
184
+ "learning_rate": 7.45903801433221e-05,
185
+ "loss": 1905.5,
186
+ "step": 336
187
+ },
188
+ {
189
+ "epoch": 0.35235235235235235,
190
+ "grad_norm": 688.0,
191
+ "learning_rate": 7.236967529989135e-05,
192
+ "loss": 2075.5,
193
+ "step": 352
194
+ },
195
+ {
196
+ "epoch": 0.3683683683683684,
197
+ "grad_norm": 1064.0,
198
+ "learning_rate": 7.009234955388256e-05,
199
+ "loss": 2129.0,
200
+ "step": 368
201
+ },
202
+ {
203
+ "epoch": 0.3843843843843844,
204
+ "grad_norm": 450.0,
205
+ "learning_rate": 6.776416714806969e-05,
206
+ "loss": 1712.75,
207
+ "step": 384
208
+ },
209
+ {
210
+ "epoch": 0.4004004004004004,
211
+ "grad_norm": 792.0,
212
+ "learning_rate": 6.539102105083139e-05,
213
+ "loss": 2101.75,
214
+ "step": 400
215
+ },
216
+ {
217
+ "epoch": 0.4004004004004004,
218
+ "eval_enwikippl": 1589.3818359375,
219
+ "eval_frwikippl": 7626.99560546875,
220
+ "eval_zhwikippl": 74891.5859375,
221
+ "step": 400
222
+ },
223
+ {
224
+ "epoch": 0.4004004004004004,
225
+ "eval_loss": 2048.0,
226
+ "eval_runtime": 0.0545,
227
+ "eval_samples_per_second": 18.334,
228
+ "eval_steps_per_second": 18.334,
229
+ "step": 400
230
+ },
231
+ {
232
+ "epoch": 0.4164164164164164,
233
+ "grad_norm": 1272.0,
234
+ "learning_rate": 6.297891804019078e-05,
235
+ "loss": 1900.5,
236
+ "step": 416
237
+ },
238
+ {
239
+ "epoch": 0.43243243243243246,
240
+ "grad_norm": 608.0,
241
+ "learning_rate": 6.0533963499786314e-05,
242
+ "loss": 2183.5,
243
+ "step": 432
244
+ },
245
+ {
246
+ "epoch": 0.44844844844844844,
247
+ "grad_norm": 972.0,
248
+ "learning_rate": 5.806234596525762e-05,
249
+ "loss": 2035.25,
250
+ "step": 448
251
+ },
252
+ {
253
+ "epoch": 0.4644644644644645,
254
+ "grad_norm": 1376.0,
255
+ "learning_rate": 5.557032146016141e-05,
256
+ "loss": 1959.25,
257
+ "step": 464
258
+ },
259
+ {
260
+ "epoch": 0.4804804804804805,
261
+ "grad_norm": 772.0,
262
+ "learning_rate": 5.306419766106582e-05,
263
+ "loss": 1929.25,
264
+ "step": 480
265
+ },
266
+ {
267
+ "epoch": 0.4964964964964965,
268
+ "grad_norm": 880.0,
269
+ "learning_rate": 5.055031793190323e-05,
270
+ "loss": 1926.5,
271
+ "step": 496
272
+ },
273
+ {
274
+ "epoch": 0.5125125125125125,
275
+ "grad_norm": 368.0,
276
+ "learning_rate": 4.8035045267993445e-05,
277
+ "loss": 2041.25,
278
+ "step": 512
279
+ },
280
+ {
281
+ "epoch": 0.5285285285285285,
282
+ "grad_norm": 792.0,
283
+ "learning_rate": 4.552474619037668e-05,
284
+ "loss": 2122.0,
285
+ "step": 528
286
+ },
287
+ {
288
+ "epoch": 0.5445445445445446,
289
+ "grad_norm": 676.0,
290
+ "learning_rate": 4.3025774631222714e-05,
291
+ "loss": 2100.5,
292
+ "step": 544
293
+ },
294
+ {
295
+ "epoch": 0.5605605605605606,
296
+ "grad_norm": 992.0,
297
+ "learning_rate": 4.054445585110418e-05,
298
+ "loss": 2100.5,
299
+ "step": 560
300
+ },
301
+ {
302
+ "epoch": 0.5765765765765766,
303
+ "grad_norm": 992.0,
304
+ "learning_rate": 3.808707042884176e-05,
305
+ "loss": 1875.5,
306
+ "step": 576
307
+ },
308
+ {
309
+ "epoch": 0.5925925925925926,
310
+ "grad_norm": 740.0,
311
+ "learning_rate": 3.5659838364445505e-05,
312
+ "loss": 1870.125,
313
+ "step": 592
314
+ },
315
+ {
316
+ "epoch": 0.6006006006006006,
317
+ "eval_enwikippl": 1461.5445556640625,
318
+ "eval_frwikippl": 7612.62939453125,
319
+ "eval_zhwikippl": 75592.3515625,
320
+ "step": 600
321
+ },
322
+ {
323
+ "epoch": 0.6006006006006006,
324
+ "eval_loss": 1968.0,
325
+ "eval_runtime": 0.0554,
326
+ "eval_samples_per_second": 18.063,
327
+ "eval_steps_per_second": 18.063,
328
+ "step": 600
329
+ },
330
+ {
331
+ "epoch": 0.6086086086086087,
332
+ "grad_norm": 780.0,
333
+ "learning_rate": 3.326890333538992e-05,
334
+ "loss": 1943.5,
335
+ "step": 608
336
+ },
337
+ {
338
+ "epoch": 0.6246246246246246,
339
+ "grad_norm": 860.0,
340
+ "learning_rate": 3.0920317146072576e-05,
341
+ "loss": 2130.375,
342
+ "step": 624
343
+ },
344
+ {
345
+ "epoch": 0.6406406406406406,
346
+ "grad_norm": 532.0,
347
+ "learning_rate": 2.8620024409816555e-05,
348
+ "loss": 1940.5,
349
+ "step": 640
350
+ },
351
+ {
352
+ "epoch": 0.6566566566566566,
353
+ "grad_norm": 848.0,
354
+ "learning_rate": 2.637384750218941e-05,
355
+ "loss": 2033.5,
356
+ "step": 656
357
+ },
358
+ {
359
+ "epoch": 0.6726726726726727,
360
+ "grad_norm": 816.0,
361
+ "learning_rate": 2.4187471823723555e-05,
362
+ "loss": 1725.0,
363
+ "step": 672
364
+ },
365
+ {
366
+ "epoch": 0.6886886886886887,
367
+ "grad_norm": 716.0,
368
+ "learning_rate": 2.2066431409340406e-05,
369
+ "loss": 2027.5,
370
+ "step": 688
371
+ },
372
+ {
373
+ "epoch": 0.7047047047047047,
374
+ "grad_norm": 708.0,
375
+ "learning_rate": 2.001609492090276e-05,
376
+ "loss": 1886.5,
377
+ "step": 704
378
+ },
379
+ {
380
+ "epoch": 0.7207207207207207,
381
+ "grad_norm": 804.0,
382
+ "learning_rate": 1.8041652058350767e-05,
383
+ "loss": 1867.875,
384
+ "step": 720
385
+ },
386
+ {
387
+ "epoch": 0.7367367367367368,
388
+ "grad_norm": 704.0,
389
+ "learning_rate": 1.6148100423816187e-05,
390
+ "loss": 2029.25,
391
+ "step": 736
392
+ },
393
+ {
394
+ "epoch": 0.7527527527527528,
395
+ "grad_norm": 736.0,
396
+ "learning_rate": 1.4340232871964493e-05,
397
+ "loss": 1824.0,
398
+ "step": 752
399
+ },
400
+ {
401
+ "epoch": 0.7687687687687688,
402
+ "grad_norm": 724.0,
403
+ "learning_rate": 1.2622625378582331e-05,
404
+ "loss": 1832.5,
405
+ "step": 768
406
+ },
407
+ {
408
+ "epoch": 0.7847847847847848,
409
+ "grad_norm": 948.0,
410
+ "learning_rate": 1.099962545811709e-05,
411
+ "loss": 2030.75,
412
+ "step": 784
413
+ },
414
+ {
415
+ "epoch": 0.8008008008008008,
416
+ "grad_norm": 832.0,
417
+ "learning_rate": 9.475341159485395e-06,
418
+ "loss": 1892.125,
419
+ "step": 800
420
+ },
421
+ {
422
+ "epoch": 0.8008008008008008,
423
+ "eval_enwikippl": 1401.9130859375,
424
+ "eval_frwikippl": 7065.296875,
425
+ "eval_zhwikippl": 59395.56640625,
426
+ "step": 800
427
+ },
428
+ {
429
+ "epoch": 0.8008008008008008,
430
+ "eval_loss": 1960.0,
431
+ "eval_runtime": 0.0547,
432
+ "eval_samples_per_second": 18.283,
433
+ "eval_steps_per_second": 18.283,
434
+ "step": 800
435
+ },
436
+ {
437
+ "epoch": 0.8168168168168168,
438
+ "grad_norm": 1096.0,
439
+ "learning_rate": 8.053630668003642e-06,
440
+ "loss": 1996.5,
441
+ "step": 816
442
+ },
443
+ {
444
+ "epoch": 0.8328328328328328,
445
+ "grad_norm": 784.0,
446
+ "learning_rate": 6.738092539759589e-06,
447
+ "loss": 1876.75,
448
+ "step": 832
449
+ },
450
+ {
451
+ "epoch": 0.8488488488488488,
452
+ "grad_norm": 676.0,
453
+ "learning_rate": 5.532056593143492e-06,
454
+ "loss": 2137.0,
455
+ "step": 848
456
+ },
457
+ {
458
+ "epoch": 0.8648648648648649,
459
+ "grad_norm": 1192.0,
460
+ "learning_rate": 4.43857548059321e-06,
461
+ "loss": 1997.0,
462
+ "step": 864
463
+ },
464
+ {
465
+ "epoch": 0.8808808808808809,
466
+ "grad_norm": 1096.0,
467
+ "learning_rate": 3.4604169618868977e-06,
468
+ "loss": 1888.5,
469
+ "step": 880
470
+ },
471
+ {
472
+ "epoch": 0.8968968968968969,
473
+ "grad_norm": 676.0,
474
+ "learning_rate": 2.6000568985402317e-06,
475
+ "loss": 2067.5,
476
+ "step": 896
477
+ },
478
+ {
479
+ "epoch": 0.9129129129129129,
480
+ "grad_norm": 756.0,
481
+ "learning_rate": 1.8596729870407837e-06,
482
+ "loss": 1946.5,
483
+ "step": 912
484
+ },
485
+ {
486
+ "epoch": 0.928928928928929,
487
+ "grad_norm": 836.0,
488
+ "learning_rate": 1.241139246781392e-06,
489
+ "loss": 1975.75,
490
+ "step": 928
491
+ },
492
+ {
493
+ "epoch": 0.944944944944945,
494
+ "grad_norm": 1328.0,
495
+ "learning_rate": 7.460212766444263e-07,
496
+ "loss": 1908.5,
497
+ "step": 944
498
+ },
499
+ {
500
+ "epoch": 0.960960960960961,
501
+ "grad_norm": 612.0,
502
+ "learning_rate": 3.755722922432481e-07,
503
+ "loss": 1877.25,
504
+ "step": 960
505
+ },
506
+ {
507
+ "epoch": 0.9769769769769769,
508
+ "grad_norm": 796.0,
509
+ "learning_rate": 1.3072995385119412e-07,
510
+ "loss": 1873.25,
511
+ "step": 976
512
+ },
513
+ {
514
+ "epoch": 0.992992992992993,
515
+ "grad_norm": 972.0,
516
+ "learning_rate": 1.2113993046969363e-08,
517
+ "loss": 1890.5,
518
+ "step": 992
519
+ }
520
+ ],
521
+ "logging_steps": 16,
522
+ "max_steps": 999,
523
+ "num_input_tokens_seen": 0,
524
+ "num_train_epochs": 1,
525
+ "save_steps": 500,
526
+ "stateful_callbacks": {
527
+ "TrainerControl": {
528
+ "args": {
529
+ "should_epoch_stop": false,
530
+ "should_evaluate": false,
531
+ "should_log": false,
532
+ "should_save": true,
533
+ "should_training_stop": true
534
+ },
535
+ "attributes": {}
536
+ }
537
+ },
538
+ "total_flos": 522061479936000.0,
539
+ "train_batch_size": 1,
540
+ "trial_name": null,
541
+ "trial_params": null
542
+ }
gpt2_model_card_distily_test/checkpoint-999/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3de33db1e43c0c23c28487ed3633e712616641870c7cc2ce0241e293a6c76792
3
+ size 907106628
gpt2_model_card_distily_test/checkpoint-999/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gpt2",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "quantization_config": {
21
+ "_load_in_4bit": false,
22
+ "_load_in_8bit": true,
23
+ "bnb_4bit_compute_dtype": "float32",
24
+ "bnb_4bit_quant_storage": "uint8",
25
+ "bnb_4bit_quant_type": "fp4",
26
+ "bnb_4bit_use_double_quant": false,
27
+ "llm_int8_enable_fp32_cpu_offload": false,
28
+ "llm_int8_has_fp16_weight": false,
29
+ "llm_int8_skip_modules": null,
30
+ "llm_int8_threshold": 6.0,
31
+ "load_in_4bit": false,
32
+ "load_in_8bit": true,
33
+ "quant_method": "bitsandbytes"
34
+ },
35
+ "reorder_and_upcast_attn": false,
36
+ "resid_pdrop": 0.1,
37
+ "scale_attn_by_inverse_layer_idx": false,
38
+ "scale_attn_weights": true,
39
+ "summary_activation": null,
40
+ "summary_first_dropout": 0.1,
41
+ "summary_proj_to_labels": true,
42
+ "summary_type": "cls_index",
43
+ "summary_use_proj": true,
44
+ "task_specific_params": {
45
+ "text-generation": {
46
+ "do_sample": true,
47
+ "max_length": 50
48
+ }
49
+ },
50
+ "torch_dtype": "bfloat16",
51
+ "transformers_version": "4.43.3",
52
+ "use_cache": true,
53
+ "vocab_size": 50257
54
+ }
gpt2_model_card_distily_test/generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.43.3"
6
+ }
gpt2_model_card_distily_test/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d03342bb9f778d0d92073e8be4e66a26e6e958a0d14aeabb0cb60e8916421b3f
3
+ size 248894656
gpt2_model_card_distily_test/runs/Aug05_20-55-15_232a0f8c3879/events.out.tfevents.1722891394.232a0f8c3879 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d947cef8023e6dbf1bca80bceb362ae24a3f9289744bc37da5fba08a35e3ecbe
3
+ size 21597
gpt2_model_card_distily_test/special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
gpt2_model_card_distily_test/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
gpt2_model_card_distily_test/tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "model_max_length": 1024,
17
+ "pad_token": "<|endoftext|>",
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
gpt2_model_card_distily_test/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3de33db1e43c0c23c28487ed3633e712616641870c7cc2ce0241e293a6c76792
3
+ size 907106628
gpt2_model_card_distily_test/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d03342bb9f778d0d92073e8be4e66a26e6e958a0d14aeabb0cb60e8916421b3f
3
  size 248894656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3d983e8d8b5d2b611125d478054e1ce1dcd475a8074b03891b83e88ceca4d3f
3
  size 248894656
runs/Aug05_21-11-07_232a0f8c3879/events.out.tfevents.1722892417.232a0f8c3879 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:864cf6e31704307b70585d73f3f4ab4e2e36faf678a59cae488e15bcb0bf5056
3
+ size 10744
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3de33db1e43c0c23c28487ed3633e712616641870c7cc2ce0241e293a6c76792
3
  size 907106628
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a96522edf3af8b738ca8c29550c8a6d85da79075ee5a027c3dc39a63ecc8940a
3
  size 907106628