Pclanglais
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,31 @@ OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or prov
|
|
11 |
## Training
|
12 |
OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
|
13 |
|
14 |
-
Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
|
17 |
|
|
|
11 |
## Training
|
12 |
OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
|
13 |
|
14 |
+
Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451). We used the following command for training, mostly default hyperparameters, including a short context window of 1,024 tokens.
|
15 |
+
|
16 |
+
```bash
|
17 |
+
srun --ntasks-per-node=4 --gres=gpu:4 ./train_gpt2cu \
|
18 |
+
-i "dev/data/english_ocr/us_ocr_instruct_*.bin" \
|
19 |
+
-j "dev/data/english_ocr/us_ocr_instruct_*.bin" \
|
20 |
+
-o ocr_model_2 \
|
21 |
+
-e "d12" \
|
22 |
+
-b 128 \
|
23 |
+
-t 1024 \
|
24 |
+
-d 2097152 \
|
25 |
+
-r 1 \
|
26 |
+
-z 1 \
|
27 |
+
-c 0.1 \
|
28 |
+
-l 0.0006 \
|
29 |
+
-q 0.0 \
|
30 |
+
-u 700 \
|
31 |
+
-n 1000 \
|
32 |
+
-v 250 \
|
33 |
+
-s 250 \
|
34 |
+
-h 1 \
|
35 |
+
-x 9060
|
36 |
+
```
|
37 |
+
|
38 |
+
Tokenization is currently done with the GPT-2 tokenizer. It will be eventually replaced by a custom tokenizer that would provide a better performance and compression for cultural heritage archives and noisy OCR sources.
|
39 |
|
40 |
OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
|
41 |
|