opensource commited on
Commit
63d283f
·
1 Parent(s): 8e67e07

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -597
README.md CHANGED
@@ -1,602 +1,9 @@
1
  ---
2
  language: multilingual
3
  ---
 
 
 
4
 
5
- # XLM-R + NER
6
-
7
- This model is a fine-tuned [XLM-Roberta-base](https://arxiv.org/abs/1911.02116) over the 40 languages proposed in [XTREME](https://github.com/google-research/xtreme) from [Wikiann](https://aclweb.org/anthology/P17-1178). This is still an on-going work and the results will be updated everytime an improvement is reached.
8
-
9
- The covered labels are:
10
- ```
11
- LOC
12
- ORG
13
- PER
14
- O
15
- ```
16
-
17
- ## Metrics on evaluation set:
18
- ### Average over the 40 languages
19
- Number of documents: 262300
20
- ```
21
- precision recall f1-score support
22
-
23
- ORG 0.81 0.81 0.81 102452
24
- PER 0.90 0.91 0.91 108978
25
- LOC 0.86 0.89 0.87 121868
26
-
27
- micro avg 0.86 0.87 0.87 333298
28
- macro avg 0.86 0.87 0.87 333298
29
- ```
30
-
31
- ### Afrikaans
32
- Number of documents: 1000
33
- ```
34
- precision recall f1-score support
35
-
36
- ORG 0.89 0.88 0.88 582
37
- PER 0.89 0.97 0.93 369
38
- LOC 0.84 0.90 0.86 518
39
-
40
- micro avg 0.87 0.91 0.89 1469
41
- macro avg 0.87 0.91 0.89 1469
42
- ```
43
-
44
- ### Arabic
45
- Number of documents: 10000
46
- ```
47
- precision recall f1-score support
48
-
49
- ORG 0.83 0.84 0.84 3507
50
- PER 0.90 0.91 0.91 3643
51
- LOC 0.88 0.89 0.88 3604
52
-
53
- micro avg 0.87 0.88 0.88 10754
54
- macro avg 0.87 0.88 0.88 10754
55
- ```
56
-
57
- ### Basque
58
- Number of documents: 10000
59
- ```
60
- precision recall f1-score support
61
-
62
- LOC 0.88 0.93 0.91 5228
63
- ORG 0.86 0.81 0.83 3654
64
- PER 0.91 0.91 0.91 4072
65
-
66
- micro avg 0.89 0.89 0.89 12954
67
- macro avg 0.89 0.89 0.89 12954
68
- ```
69
-
70
- ### Bengali
71
- Number of documents: 1000
72
- ```
73
- precision recall f1-score support
74
-
75
- ORG 0.86 0.89 0.87 325
76
- LOC 0.91 0.91 0.91 406
77
- PER 0.96 0.95 0.95 364
78
-
79
- micro avg 0.91 0.92 0.91 1095
80
- macro avg 0.91 0.92 0.91 1095
81
- ```
82
-
83
- ### Bulgarian
84
- Number of documents: 1000
85
- ```
86
- precision recall f1-score support
87
-
88
- ORG 0.86 0.83 0.84 3661
89
- PER 0.92 0.95 0.94 4006
90
- LOC 0.92 0.95 0.94 6449
91
-
92
- micro avg 0.91 0.92 0.91 14116
93
- macro avg 0.91 0.92 0.91 14116
94
- ```
95
-
96
- ### Burmese
97
- Number of documents: 100
98
- ```
99
- precision recall f1-score support
100
-
101
- LOC 0.60 0.86 0.71 37
102
- ORG 0.68 0.63 0.66 30
103
- PER 0.44 0.44 0.44 36
104
-
105
- micro avg 0.57 0.65 0.61 103
106
- macro avg 0.57 0.65 0.60 103
107
- ```
108
-
109
- ### Chinese
110
- Number of documents: 10000
111
- ```
112
- precision recall f1-score support
113
-
114
- ORG 0.70 0.69 0.70 4022
115
- LOC 0.76 0.81 0.78 3830
116
- PER 0.84 0.84 0.84 3706
117
-
118
- micro avg 0.76 0.78 0.77 11558
119
- macro avg 0.76 0.78 0.77 11558
120
- ```
121
-
122
- ### Dutch
123
- Number of documents: 10000
124
- ```
125
- precision recall f1-score support
126
-
127
- ORG 0.87 0.87 0.87 3930
128
- PER 0.95 0.95 0.95 4377
129
- LOC 0.91 0.92 0.91 4813
130
-
131
- micro avg 0.91 0.92 0.91 13120
132
- macro avg 0.91 0.92 0.91 13120
133
- ```
134
-
135
- ### English
136
- Number of documents: 10000
137
- ```
138
- precision recall f1-score support
139
-
140
- LOC 0.83 0.84 0.84 4781
141
- PER 0.89 0.90 0.89 4559
142
- ORG 0.75 0.75 0.75 4633
143
-
144
- micro avg 0.82 0.83 0.83 13973
145
- macro avg 0.82 0.83 0.83 13973
146
- ```
147
-
148
- ### Estonian
149
- Number of documents: 10000
150
- ```
151
- precision recall f1-score support
152
-
153
- LOC 0.89 0.92 0.91 5654
154
- ORG 0.85 0.85 0.85 3878
155
- PER 0.94 0.94 0.94 4026
156
-
157
- micro avg 0.90 0.91 0.90 13558
158
- macro avg 0.90 0.91 0.90 13558
159
- ```
160
-
161
- ### Finnish
162
- Number of documents: 10000
163
- ```
164
- precision recall f1-score support
165
-
166
- ORG 0.84 0.83 0.84 4104
167
- LOC 0.88 0.90 0.89 5307
168
- PER 0.95 0.94 0.94 4519
169
-
170
- micro avg 0.89 0.89 0.89 13930
171
- macro avg 0.89 0.89 0.89 13930
172
- ```
173
-
174
- ### French
175
- Number of documents: 10000
176
- ```
177
- precision recall f1-score support
178
-
179
- LOC 0.90 0.89 0.89 4808
180
- ORG 0.84 0.87 0.85 3876
181
- PER 0.94 0.93 0.94 4249
182
-
183
- micro avg 0.89 0.90 0.90 12933
184
- macro avg 0.89 0.90 0.90 12933
185
- ```
186
-
187
- ### Georgian
188
- Number of documents: 10000
189
- ```
190
- precision recall f1-score support
191
-
192
- PER 0.90 0.91 0.90 3964
193
- ORG 0.83 0.77 0.80 3757
194
- LOC 0.82 0.88 0.85 4894
195
-
196
- micro avg 0.84 0.86 0.85 12615
197
- macro avg 0.84 0.86 0.85 12615
198
- ```
199
-
200
- ### German
201
- Number of documents: 10000
202
- ```
203
- precision recall f1-score support
204
-
205
- LOC 0.85 0.90 0.87 4939
206
- PER 0.94 0.91 0.92 4452
207
- ORG 0.79 0.78 0.79 4247
208
-
209
- micro avg 0.86 0.86 0.86 13638
210
- macro avg 0.86 0.86 0.86 13638
211
- ```
212
-
213
- ### Greek
214
- Number of documents: 10000
215
- ```
216
- precision recall f1-score support
217
-
218
- ORG 0.86 0.85 0.85 3771
219
- LOC 0.88 0.91 0.90 4436
220
- PER 0.91 0.93 0.92 3894
221
-
222
- micro avg 0.88 0.90 0.89 12101
223
- macro avg 0.88 0.90 0.89 12101
224
- ```
225
-
226
- ### Hebrew
227
- Number of documents: 10000
228
- ```
229
- precision recall f1-score support
230
-
231
- PER 0.87 0.88 0.87 4206
232
- ORG 0.76 0.75 0.76 4190
233
- LOC 0.85 0.85 0.85 4538
234
-
235
- micro avg 0.83 0.83 0.83 12934
236
- macro avg 0.82 0.83 0.83 12934
237
- ```
238
-
239
- ### Hindi
240
- Number of documents: 1000
241
- ```
242
- precision recall f1-score support
243
-
244
- ORG 0.78 0.81 0.79 362
245
- LOC 0.83 0.85 0.84 422
246
- PER 0.90 0.95 0.92 427
247
-
248
- micro avg 0.84 0.87 0.85 1211
249
- macro avg 0.84 0.87 0.85 1211
250
- ```
251
-
252
- ### Hungarian
253
- Number of documents: 10000
254
- ```
255
- precision recall f1-score support
256
-
257
- PER 0.95 0.95 0.95 4347
258
- ORG 0.87 0.88 0.87 3988
259
- LOC 0.90 0.92 0.91 5544
260
-
261
- micro avg 0.91 0.92 0.91 13879
262
- macro avg 0.91 0.92 0.91 13879
263
- ```
264
-
265
- ### Indonesian
266
- Number of documents: 10000
267
- ```
268
- precision recall f1-score support
269
-
270
- ORG 0.88 0.89 0.88 3735
271
- LOC 0.93 0.95 0.94 3694
272
- PER 0.93 0.93 0.93 3947
273
-
274
- micro avg 0.91 0.92 0.92 11376
275
- macro avg 0.91 0.92 0.92 11376
276
- ```
277
-
278
- ### Italian
279
- Number of documents: 10000
280
- ```
281
- precision recall f1-score support
282
-
283
- LOC 0.88 0.88 0.88 4592
284
- ORG 0.86 0.86 0.86 4088
285
- PER 0.96 0.96 0.96 4732
286
-
287
- micro avg 0.90 0.90 0.90 13412
288
- macro avg 0.90 0.90 0.90 13412
289
- ```
290
-
291
- ### Japanese
292
- Number of documents: 10000
293
- ```
294
- precision recall f1-score support
295
-
296
- ORG 0.62 0.61 0.62 4184
297
- PER 0.76 0.81 0.78 3812
298
- LOC 0.68 0.74 0.71 4281
299
-
300
- micro avg 0.69 0.72 0.70 12277
301
- macro avg 0.69 0.72 0.70 12277
302
- ```
303
-
304
- ### Javanese
305
- Number of documents: 100
306
- ```
307
- precision recall f1-score support
308
-
309
- ORG 0.79 0.80 0.80 46
310
- PER 0.81 0.96 0.88 26
311
- LOC 0.75 0.75 0.75 40
312
-
313
- micro avg 0.78 0.82 0.80 112
314
- macro avg 0.78 0.82 0.80 112
315
- ```
316
-
317
- ### Kazakh
318
- Number of documents: 1000
319
- ```
320
- precision recall f1-score support
321
-
322
- ORG 0.76 0.61 0.68 307
323
- LOC 0.78 0.90 0.84 461
324
- PER 0.87 0.91 0.89 367
325
-
326
- micro avg 0.81 0.83 0.82 1135
327
- macro avg 0.81 0.83 0.81 1135
328
- ```
329
-
330
- ### Korean
331
- Number of documents: 10000
332
- ```
333
- precision recall f1-score support
334
-
335
- LOC 0.86 0.89 0.88 5097
336
- ORG 0.79 0.74 0.77 4218
337
- PER 0.83 0.86 0.84 4014
338
-
339
- micro avg 0.83 0.83 0.83 13329
340
- macro avg 0.83 0.83 0.83 13329
341
- ```
342
-
343
- ### Malay
344
- Number of documents: 1000
345
- ```
346
- precision recall f1-score support
347
-
348
- ORG 0.87 0.89 0.88 368
349
- PER 0.92 0.91 0.91 366
350
- LOC 0.94 0.95 0.95 354
351
-
352
- micro avg 0.91 0.92 0.91 1088
353
- macro avg 0.91 0.92 0.91 1088
354
- ```
355
-
356
- ### Malayalam
357
- Number of documents: 1000
358
- ```
359
- precision recall f1-score support
360
-
361
- ORG 0.75 0.74 0.75 347
362
- PER 0.84 0.89 0.86 417
363
- LOC 0.74 0.75 0.75 391
364
-
365
- micro avg 0.78 0.80 0.79 1155
366
- macro avg 0.78 0.80 0.79 1155
367
- ```
368
-
369
- ### Marathi
370
- Number of documents: 1000
371
- ```
372
- precision recall f1-score support
373
-
374
- PER 0.89 0.94 0.92 394
375
- LOC 0.82 0.84 0.83 457
376
- ORG 0.84 0.78 0.81 339
377
-
378
- micro avg 0.85 0.86 0.85 1190
379
- macro avg 0.85 0.86 0.85 1190
380
- ```
381
-
382
- ### Persian
383
- Number of documents: 10000
384
- ```
385
- precision recall f1-score support
386
-
387
- PER 0.93 0.92 0.93 3540
388
- LOC 0.93 0.93 0.93 3584
389
- ORG 0.89 0.92 0.90 3370
390
-
391
- micro avg 0.92 0.92 0.92 10494
392
- macro avg 0.92 0.92 0.92 10494
393
- ```
394
-
395
- ### Portuguese
396
- Number of documents: 10000
397
- ```
398
- precision recall f1-score support
399
-
400
- LOC 0.90 0.91 0.91 4819
401
- PER 0.94 0.92 0.93 4184
402
- ORG 0.84 0.88 0.86 3670
403
-
404
- micro avg 0.89 0.91 0.90 12673
405
- macro avg 0.90 0.91 0.90 12673
406
- ```
407
-
408
- ### Russian
409
- Number of documents: 10000
410
- ```
411
- precision recall f1-score support
412
-
413
- PER 0.93 0.96 0.95 3574
414
- LOC 0.87 0.89 0.88 4619
415
- ORG 0.82 0.80 0.81 3858
416
-
417
- micro avg 0.87 0.88 0.88 12051
418
- macro avg 0.87 0.88 0.88 12051
419
- ```
420
-
421
- ### Spanish
422
- Number of documents: 10000
423
- ```
424
- precision recall f1-score support
425
-
426
- PER 0.95 0.93 0.94 3891
427
- ORG 0.86 0.88 0.87 3709
428
- LOC 0.89 0.91 0.90 4553
429
-
430
- micro avg 0.90 0.91 0.90 12153
431
- macro avg 0.90 0.91 0.90 12153
432
- ```
433
-
434
- ### Swahili
435
- Number of documents: 1000
436
- ```
437
- precision recall f1-score support
438
-
439
- ORG 0.82 0.85 0.83 349
440
- PER 0.95 0.92 0.94 403
441
- LOC 0.86 0.89 0.88 450
442
-
443
- micro avg 0.88 0.89 0.88 1202
444
- macro avg 0.88 0.89 0.88 1202
445
- ```
446
-
447
- ### Tagalog
448
- Number of documents: 1000
449
- ```
450
- precision recall f1-score support
451
-
452
- LOC 0.90 0.91 0.90 338
453
- ORG 0.83 0.91 0.87 339
454
- PER 0.96 0.93 0.95 350
455
-
456
- micro avg 0.90 0.92 0.91 1027
457
- macro avg 0.90 0.92 0.91 1027
458
- ```
459
-
460
- ### Tamil
461
- Number of documents: 1000
462
- ```
463
- precision recall f1-score support
464
-
465
- PER 0.90 0.92 0.91 392
466
- ORG 0.77 0.76 0.76 370
467
- LOC 0.78 0.81 0.79 421
468
-
469
- micro avg 0.82 0.83 0.82 1183
470
- macro avg 0.82 0.83 0.82 1183
471
- ```
472
-
473
- ### Telugu
474
- Number of documents: 1000
475
- ```
476
- precision recall f1-score support
477
-
478
- ORG 0.67 0.55 0.61 347
479
- LOC 0.78 0.87 0.82 453
480
- PER 0.73 0.86 0.79 393
481
-
482
- micro avg 0.74 0.77 0.76 1193
483
- macro avg 0.73 0.77 0.75 1193
484
- ```
485
-
486
- ### Thai
487
- Number of documents: 10000
488
- ```
489
- precision recall f1-score support
490
-
491
- LOC 0.63 0.76 0.69 3928
492
- PER 0.78 0.83 0.80 6537
493
- ORG 0.59 0.59 0.59 4257
494
-
495
- micro avg 0.68 0.74 0.71 14722
496
- macro avg 0.68 0.74 0.71 14722
497
- ```
498
-
499
- ### Turkish
500
- Number of documents: 10000
501
- ```
502
- precision recall f1-score support
503
-
504
- PER 0.94 0.94 0.94 4337
505
- ORG 0.88 0.89 0.88 4094
506
- LOC 0.90 0.92 0.91 4929
507
-
508
- micro avg 0.90 0.92 0.91 13360
509
- macro avg 0.91 0.92 0.91 13360
510
- ```
511
-
512
- ### Urdu
513
- Number of documents: 1000
514
- ```
515
- precision recall f1-score support
516
-
517
- LOC 0.90 0.95 0.93 352
518
- PER 0.96 0.96 0.96 333
519
- ORG 0.91 0.90 0.90 326
520
-
521
- micro avg 0.92 0.94 0.93 1011
522
- macro avg 0.92 0.94 0.93 1011
523
- ```
524
-
525
- ### Vietnamese
526
- Number of documents: 10000
527
- ```
528
- precision recall f1-score support
529
-
530
- ORG 0.86 0.87 0.86 3579
531
- LOC 0.88 0.91 0.90 3811
532
- PER 0.92 0.93 0.93 3717
533
-
534
- micro avg 0.89 0.90 0.90 11107
535
- macro avg 0.89 0.90 0.90 11107
536
- ```
537
-
538
- ### Yoruba
539
- Number of documents: 100
540
- ```
541
- precision recall f1-score support
542
-
543
- LOC 0.54 0.72 0.62 36
544
- ORG 0.58 0.31 0.41 35
545
- PER 0.77 1.00 0.87 36
546
-
547
- micro avg 0.64 0.68 0.66 107
548
- macro avg 0.63 0.68 0.63 107
549
- ```
550
-
551
- ## Reproduce the results
552
- Download and prepare the dataset from the [XTREME repo](https://github.com/google-research/xtreme#download-the-data). Next, from the root of the transformers repo run:
553
- ```
554
- cd examples/ner
555
- python run_tf_ner.py \
556
- --data_dir . \
557
- --labels ./labels.txt \
558
- --model_name_or_path jplu/tf-xlm-roberta-base \
559
- --output_dir model \
560
- --max-seq-length 128 \
561
- --num_train_epochs 2 \
562
- --per_gpu_train_batch_size 16 \
563
- --per_gpu_eval_batch_size 32 \
564
- --do_train \
565
- --do_eval \
566
- --logging_dir logs \
567
- --mode token-classification \
568
- --evaluate_during_training \
569
- --optimizer_name adamw
570
- ```
571
-
572
- ## Usage with pipelines
573
- ```python
574
- from transformers import pipeline
575
-
576
- nlp_ner = pipeline(
577
- "ner",
578
- model="jplu/tf-xlm-r-ner-40-lang",
579
- tokenizer=(
580
- 'jplu/tf-xlm-r-ner-40-lang',
581
- {"use_fast": True}),
582
- framework="tf"
583
- )
584
-
585
- text_fr = "Barack Obama est né à Hawaï."
586
- text_en = "Barack Obama was born in Hawaii."
587
- text_es = "Barack Obama nació en Hawai."
588
- text_zh = "巴拉克·奧巴馬(Barack Obama)出生於夏威夷。"
589
- text_ar = "ولد باراك أوباما في هاواي."
590
-
591
- nlp_ner(text_fr)
592
- #Output: [{'word': '▁Barack', 'score': 0.9894659519195557, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9888848662376404, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.998701810836792, 'entity': 'LOC'}, {'word': 'ï', 'score': 0.9987035989761353, 'entity': 'LOC'}]
593
- nlp_ner(text_en)
594
- #Output: [{'word': '▁Barack', 'score': 0.9929141998291016, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9930834174156189, 'entity': 'PER'}, {'word': '▁Hawaii', 'score': 0.9986202120780945, 'entity': 'LOC'}]
595
- nlp_ner(test_es)
596
- #Output: [{'word': '▁Barack', 'score': 0.9944776296615601, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9949177503585815, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.9987911581993103, 'entity': 'LOC'}, {'word': 'i', 'score': 0.9984861612319946, 'entity': 'LOC'}]
597
- nlp_ner(test_zh)
598
- #Output: [{'word': '夏威夷', 'score': 0.9988449215888977, 'entity': 'LOC'}]
599
- nlp_ner(test_ar)
600
- #Output: [{'word': '▁با', 'score': 0.9903655648231506, 'entity': 'PER'}, {'word': 'راك', 'score': 0.9850614666938782, 'entity': 'PER'}, {'word': '▁أوباما', 'score': 0.9850308299064636, 'entity': 'PER'}, {'word': '▁ها', 'score': 0.9477543234825134, 'entity': 'LOC'}, {'word': 'وا', 'score': 0.9428229928016663, 'entity': 'LOC'}, {'word': 'ي', 'score': 0.9319471716880798, 'entity': 'LOC'}]
601
 
602
- ```
 
1
  ---
2
  language: multilingual
3
  ---
4
+ ---
5
+ license: MIT
6
+ ---
7
 
8
+ ## Extract names in any language.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9