Pclanglais commited on
Commit
aca113b
·
verified ·
1 Parent(s): 62752ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -11,7 +11,31 @@ OCRonos-Vintage is only 124 million parameters. It can run easily on CPU or prov
11
  ## Training
12
  OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
13
 
14
- Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
17
 
 
11
  ## Training
12
  OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens.
13
 
14
+ Pre-training ran on 2 epochs with llm.c (9060 steps total) on 4 H100s for two and a half hour. It is one of the first models trained on the new Jean Zay H100 cluster (compute grant n°GC011015451). We used the following command for training, mostly default hyperparameters, including a short context window of 1,024 tokens.
15
+
16
+ ```bash
17
+ srun --ntasks-per-node=4 --gres=gpu:4 ./train_gpt2cu \
18
+ -i "dev/data/english_ocr/us_ocr_instruct_*.bin" \
19
+ -j "dev/data/english_ocr/us_ocr_instruct_*.bin" \
20
+ -o ocr_model_2 \
21
+ -e "d12" \
22
+ -b 128 \
23
+ -t 1024 \
24
+ -d 2097152 \
25
+ -r 1 \
26
+ -z 1 \
27
+ -c 0.1 \
28
+ -l 0.0006 \
29
+ -q 0.0 \
30
+ -u 700 \
31
+ -n 1000 \
32
+ -v 250 \
33
+ -s 250 \
34
+ -h 1 \
35
+ -x 9060
36
+ ```
37
+
38
+ Tokenization is currently done with the GPT-2 tokenizer. It will be eventually replaced by a custom tokenizer that would provide a better performance and compression for cultural heritage archives and noisy OCR sources.
39
 
40
  OCRonos-Vintage is an *historical* language model with a hard cut-off date of December 29th, 1955 and the vast majority prior to 1940. Roughly 65% of the content has been published between 1880 and 1920.
41