catherinearnett commited on
Commit
6fcbe26
·
verified ·
1 Parent(s): ba253f8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -16
README.md CHANGED
@@ -14,6 +14,8 @@ library_name: transformers
14
 
15
  This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on Polish data. In the second half of training, the model was trained on only English data. At the end of training, 50% of training data seen by the model is Polish and 50% is English. The tokenizer was trained on the same overall proportions of data as the language model at the final step.
16
 
 
 
17
  ## Model details:
18
 
19
  All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
@@ -39,23 +41,20 @@ Load the model:
39
  Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
40
 
41
  ```
42
- from transformers import AutoTokenizer, AutoModel
43
-
44
- tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_pl_en_sequential")
45
- model = AutoModel.from_pretrained("catherinearnett/B-GPT_pl_en_sequential", revision = "128000")
46
-
47
 
48
- ````
 
 
49
 
50
  Text Generation:
51
 
52
  ```
53
  from transformers import pipeline
54
 
55
- pipe = pipeline("text-generation", model="catherinearnett/B-GPT_pl_en_sequential")
56
 
57
- pipe("I am a")
58
-
59
  ```
60
 
61
  ## Citation
@@ -63,11 +62,5 @@ pipe("I am a")
63
  If you use this model, please cite:
64
 
65
  ```
66
- @article{arnett2025acquisition,
67
- author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
68
- title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
69
- journal = {arXiv preprint arXiv:2503.03962},
70
- year = {2025},
71
- url = {https://arxiv.org/abs/2503.03962}
72
- }
73
  ```
 
 
14
 
15
  This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on Polish data. In the second half of training, the model was trained on only English data. At the end of training, 50% of training data seen by the model is Polish and 50% is English. The tokenizer was trained on the same overall proportions of data as the language model at the final step.
16
 
17
+ This model was released alongside the paper [On the Acquisition of Shared Grammatical Representations in Bilingual Language Models](https://arxiv.org/abs/2503.03962), which contains more details about the models. Additionally, the [OSF page](https://osf.io/5cw2e/) provides all code and data related to the project.
18
+
19
  ## Model details:
20
 
21
  All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
 
41
  Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
42
 
43
  ```
44
+ from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
45
 
46
+ tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
47
+ model = AutoModelForCausalLM.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
48
+ ```
49
 
50
  Text Generation:
51
 
52
  ```
53
  from transformers import pipeline
54
 
55
+ pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
56
 
57
+ print(pipe("I am a", max_length=20)[0]["generated_text"])
 
58
  ```
59
 
60
  ## Citation
 
62
  If you use this model, please cite:
63
 
64
  ```
 
 
 
 
 
 
 
65
  ```
66
+