Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -11,7 +11,31 @@ OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or prov
 ## Training
 OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
-Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451).
 OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.

 ## Training
 OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
+Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451). We used the following command for training, mostly default hyperparameters, including a short context window of 1,024 tokens.
+```bash
+srun --ntasks-per-node=4 --gres=gpu:4 ./train_gpt2cu \
+    -i "dev/data/english_ocr/us_ocr_instruct_*.bin" \
+    -j "dev/data/english_ocr/us_ocr_instruct_*.bin" \
+    -o ocr_model_2 \
+    -e "d12" \
+    -b 128 \
+    -t 1024 \
+    -d 2097152 \
+    -r 1 \
+    -z 1 \
+    -c 0.1 \
+    -l 0.0006 \
+    -q 0.0 \
+    -u 700 \
+    -n 1000 \
+    -v 250 \
+    -s 250 \
+    -h 1 \
+    -x 9060
+```
+Tokenization is currently done with the GPT-2 tokenizer. It will be eventually replaced by a custom tokenizer that would provide a better performance and compression for cultural heritage archives and noisy OCR sources.
 OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.