lysandre HF staff commited on
Commit
aeffd76
1 Parent(s): 4755549

Create ALBERT Base v1 readme

Browse files
Files changed (1) hide show
  1. README.md +267 -0
README.md CHANGED
@@ -2,9 +2,276 @@
2
  tags:
3
  - exbert
4
 
 
5
  license: apache-2.0
 
 
 
6
  ---
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  <a href="https://huggingface.co/exbert/?model=albert-base-v1">
9
  <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
10
  </a>
 
2
  tags:
3
  - exbert
4
 
5
+ language: en
6
  license: apache-2.0
7
+ datasets:
8
+ - bookcorpus
9
+ - wikipedia
10
  ---
11
 
12
+ # ALBERT Base v1
13
+
14
+ Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
15
+ [this paper](https://arxiv.org/abs/1909.11942) and first released in
16
+ [this repository](https://github.com/google-research/albert). This model, as all ALBERT models, is uncased: it does not make a difference
17
+ between english and English.
18
+
19
+ Disclaimer: The team releasing ALBERT did not write a model card for this model so this model card has been written by
20
+ the Hugging Face team.
21
+
22
+ ## Model description
23
+
24
+ ALBERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
25
+ was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
26
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
27
+ was pretrained with two objectives:
28
+
29
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
30
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
31
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
32
+ GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
33
+ sentence.
34
+ - Sentence Ordering Prediction (SOP): ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
35
+
36
+ This way, the model learns an inner representation of the English language that can then be used to extract features
37
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
38
+ classifier using the features produced by the ALBERT model as inputs.
39
+
40
+ ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
41
+
42
+ This is the first version of the base model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
43
+
44
+ This model has the following configuration:
45
+
46
+ - 12 repeating layers
47
+ - 128 embedding dimension
48
+ - 768 hidden dimension
49
+ - 12 attention heads
50
+ - 11M parameters
51
+
52
+ ## Intended uses & limitations
53
+
54
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
55
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=albert) to look for
56
+ fine-tuned versions on a task that interests you.
57
+
58
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
59
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
60
+ generation you should look at model like GPT2.
61
+
62
+ ### How to use
63
+
64
+ You can use this model directly with a pipeline for masked language modeling:
65
+
66
+ ```python
67
+ >>> from transformers import pipeline
68
+ >>> unmasker = pipeline('fill-mask', model='albert-base-v1')
69
+ >>> unmasker("Hello I'm a [MASK] model.")
70
+ [
71
+ {
72
+ "sequence":"[CLS] hello i'm a modeling model.[SEP]",
73
+ "score":0.05816134437918663,
74
+ "token":12807,
75
+ "token_str":"▁modeling"
76
+ },
77
+ {
78
+ "sequence":"[CLS] hello i'm a modelling model.[SEP]",
79
+ "score":0.03748830780386925,
80
+ "token":23089,
81
+ "token_str":"▁modelling"
82
+ },
83
+ {
84
+ "sequence":"[CLS] hello i'm a model model.[SEP]",
85
+ "score":0.033725276589393616,
86
+ "token":1061,
87
+ "token_str":"▁model"
88
+ },
89
+ {
90
+ "sequence":"[CLS] hello i'm a runway model.[SEP]",
91
+ "score":0.017313428223133087,
92
+ "token":8014,
93
+ "token_str":"▁runway"
94
+ },
95
+ {
96
+ "sequence":"[CLS] hello i'm a lingerie model.[SEP]",
97
+ "score":0.014405295252799988,
98
+ "token":29104,
99
+ "token_str":"▁lingerie"
100
+ }
101
+ ]
102
+ ```
103
+
104
+ Here is how to use this model to get the features of a given text in PyTorch:
105
+
106
+ ```python
107
+ from transformers import AlbertTokenizer, AlbertModel
108
+ tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
109
+ model = AlbertModel.from_pretrained("albert-base-v1")
110
+ text = "Replace me by any text you'd like."
111
+ encoded_input = tokenizer(text, return_tensors='pt')
112
+ output = model(**encoded_input)
113
+ ```
114
+
115
+ and in TensorFlow:
116
+
117
+ ```python
118
+ from transformers import AlbertTokenizer, TFAlbertModel
119
+ tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
120
+ model = TFAlbertModel.from_pretrained("albert-base-v1")
121
+ text = "Replace me by any text you'd like."
122
+ encoded_input = tokenizer(text, return_tensors='tf')
123
+ output = model(encoded_input)
124
+ ```
125
+
126
+ ### Limitations and bias
127
+
128
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
129
+ predictions:
130
+
131
+ ```python
132
+ >>> from transformers import pipeline
133
+ >>> unmasker = pipeline('fill-mask', model='albert-base-v1')
134
+ >>> unmasker("The man worked as a [MASK].")
135
+
136
+ [
137
+ {
138
+ "sequence":"[CLS] the man worked as a chauffeur.[SEP]",
139
+ "score":0.029577180743217468,
140
+ "token":28744,
141
+ "token_str":"▁chauffeur"
142
+ },
143
+ {
144
+ "sequence":"[CLS] the man worked as a janitor.[SEP]",
145
+ "score":0.028865724802017212,
146
+ "token":29477,
147
+ "token_str":"▁janitor"
148
+ },
149
+ {
150
+ "sequence":"[CLS] the man worked as a shoemaker.[SEP]",
151
+ "score":0.02581118606030941,
152
+ "token":29024,
153
+ "token_str":"▁shoemaker"
154
+ },
155
+ {
156
+ "sequence":"[CLS] the man worked as a blacksmith.[SEP]",
157
+ "score":0.01849772222340107,
158
+ "token":21238,
159
+ "token_str":"▁blacksmith"
160
+ },
161
+ {
162
+ "sequence":"[CLS] the man worked as a lawyer.[SEP]",
163
+ "score":0.01820771023631096,
164
+ "token":3672,
165
+ "token_str":"▁lawyer"
166
+ }
167
+ ]
168
+
169
+ >>> unmasker("The woman worked as a [MASK].")
170
+
171
+ [
172
+ {
173
+ "sequence":"[CLS] the woman worked as a receptionist.[SEP]",
174
+ "score":0.04604868218302727,
175
+ "token":25331,
176
+ "token_str":"▁receptionist"
177
+ },
178
+ {
179
+ "sequence":"[CLS] the woman worked as a janitor.[SEP]",
180
+ "score":0.028220869600772858,
181
+ "token":29477,
182
+ "token_str":"▁janitor"
183
+ },
184
+ {
185
+ "sequence":"[CLS] the woman worked as a paramedic.[SEP]",
186
+ "score":0.0261906236410141,
187
+ "token":23386,
188
+ "token_str":"▁paramedic"
189
+ },
190
+ {
191
+ "sequence":"[CLS] the woman worked as a chauffeur.[SEP]",
192
+ "score":0.024797942489385605,
193
+ "token":28744,
194
+ "token_str":"▁chauffeur"
195
+ },
196
+ {
197
+ "sequence":"[CLS] the woman worked as a waitress.[SEP]",
198
+ "score":0.024124596267938614,
199
+ "token":13678,
200
+ "token_str":"▁waitress"
201
+ }
202
+ ]
203
+ ```
204
+
205
+ This bias will also affect all fine-tuned versions of this model.
206
+
207
+ ## Training data
208
+
209
+ The ALBERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
210
+ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
211
+ headers).
212
+
213
+ ## Training procedure
214
+
215
+ ### Preprocessing
216
+
217
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 30,000. The inputs of the model are
218
+ then of the form:
219
+
220
+ ```
221
+ [CLS] Sentence A [SEP] Sentence B [SEP]
222
+ ```
223
+
224
+ ### Training
225
+
226
+ The ALBERT procedure follows the BERT setup.
227
+
228
+ The details of the masking procedure for each sentence are the following:
229
+ - 15% of the tokens are masked.
230
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
231
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
232
+ - In the 10% remaining cases, the masked tokens are left as is.
233
+
234
+ ## Evaluation results
235
+
236
+ When fine-tuned on downstream tasks, the ALBERT models achieve the following results:
237
+
238
+ | | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE |
239
+ |----------------|----------|----------|----------|----------|----------|----------|
240
+ |V2 |
241
+ |ALBERT-base |82.3 |90.2/83.2 |82.1/79.3 |84.6 |92.9 |66.8 |
242
+ |ALBERT-large |85.7 |91.8/85.2 |84.9/81.8 |86.5 |94.9 |75.2 |
243
+ |ALBERT-xlarge |87.9 |92.9/86.4 |87.9/84.1 |87.9 |95.4 |80.7 |
244
+ |ALBERT-xxlarge |90.9 |94.6/89.1 |89.8/86.9 |90.6 |96.8 |86.8 |
245
+ |V1 |
246
+ |ALBERT-base |80.1 |89.3/82.3 | 80.0/77.1|81.6 |90.3 | 64.0 |
247
+ |ALBERT-large |82.4 |90.6/83.9 | 82.3/79.4|83.5 |91.7 | 68.5 |
248
+ |ALBERT-xlarge |85.5 |92.5/86.1 | 86.1/83.1|86.4 |92.4 | 74.8 |
249
+ |ALBERT-xxlarge |91.0 |94.8/89.3 | 90.2/87.4|90.8 |96.9 | 86.5 |
250
+
251
+
252
+ ### BibTeX entry and citation info
253
+
254
+ ```bibtex
255
+ @article{DBLP:journals/corr/abs-1909-11942,
256
+ author = {Zhenzhong Lan and
257
+ Mingda Chen and
258
+ Sebastian Goodman and
259
+ Kevin Gimpel and
260
+ Piyush Sharma and
261
+ Radu Soricut},
262
+ title = {{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language
263
+ Representations},
264
+ journal = {CoRR},
265
+ volume = {abs/1909.11942},
266
+ year = {2019},
267
+ url = {http://arxiv.org/abs/1909.11942},
268
+ archivePrefix = {arXiv},
269
+ eprint = {1909.11942},
270
+ timestamp = {Fri, 27 Sep 2019 13:04:21 +0200},
271
+ biburl = {https://dblp.org/rec/journals/corr/abs-1909-11942.bib},
272
+ bibsource = {dblp computer science bibliography, https://dblp.org}
273
+ }
274
+ ```
275
  <a href="https://huggingface.co/exbert/?model=albert-base-v1">
276
  <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
277
  </a>