metadata
language: fr
tags:
- NER
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- f1
- precision
- recall
base_model:
- almanach/camembert-large
pipeline_tag: token-classification
INTRODUCTION:
This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.
The predicted entities are:
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
- locations (LOC): le sud, Mars, l'océan, le bois, ...
- vehicles (VEH): avion, voitures, calèche, vélos, ...
MODEL PERFORMANCES (LOOCV):
NER_tag | precision | recall | f1_score | support | support % |
---|---|---|---|---|---|
PER | 92.46% | 93.71% | 93.08% | 32,204 | 84.13% |
FAC | 70.63% | 70.94% | 70.78% | 2,295 | 6.00% |
TIME | 58.66% | 57.75% | 58.20% | 1,671 | 4.37% |
GPE | 77.64% | 77.37% | 77.50% | 866 | 2.26% |
LOC | 62.96% | 45.71% | 52.97% | 781 | 2.04% |
VEH | 63.43% | 47.95% | 54.61% | 463 | 1.21% |
micro_avg | 88.39% | 88.87% | 88.58% | 38,280 | 100.00% |
macro_avg | 70.96% | 65.57% | 67.86% | 38,280 | 100.00% |
TRAINING PARAMETERS:
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
- Tagging scheme: BIOES
- Nested entities levels: [0, 1]
- Split strategy: Leave-one-out cross-validation (28 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16
- Initial learning rate: 0.00014
MODEL ARCHITECTURE:
Model Input: Maximum context camembert-large embeddings (1024 dimensions)
Locked Dropout: 0.5
Projection layer:
- layer type: highway layer
- input: 1024 dimensions
- output: 2048 dimensions
BiLSTM layer:
- input: 2048 dimensions
- output: 256 dimensions (hidden state)
Linear layer:
- input: 256 dimensions
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
CRF layer
Model Output: BIOES labels sequence
HOW TO USE:
*** IN CONSTRUCTION ***
TRAINING CORPUS:
Document | Tokens Count | Is included in model eval | |
---|---|---|---|
0 | 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote | 24,776 tokens | True |
1 | 1830_Balzac-Honoré-de_Sarrasine | 15,408 tokens | True |
2 | 1836_Gautier-Théophile_La-morte-amoureuse | 14,293 tokens | True |
3 | 1837_Balzac-Honoré-de_La-maison-Nucingen | 30,034 tokens | True |
4 | 1841_Sand-George_Pauline | 12,398 tokens | True |
5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
6 | 1863_Gautier-Théophile_Le-capitaine-Fracasse | 11,848 tokens | True |
7 | 1873_Zola-Émile_Le-ventre-de-Paris | 12,613 tokens | True |
8 | 1881_Flaubert-Gustave_Bouvard-et-Pécuchet | 12,308 tokens | True |
9 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche | 2,267 tokens | True |
10 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique | 2,041 tokens | True |
11 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille | 2,949 tokens | True |
12 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste | 2,578 tokens | True |
13 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca | 4,078 tokens | True |
14 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval | 2,878 tokens | True |
15 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou | 1,905 tokens | True |
16 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi | 5,439 tokens | True |
17 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil | 2,159 tokens | True |
18 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon | 2,364 tokens | True |
19 | 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse | 2,469 tokens | True |
20 | 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,775 tokens | True |
21 | 1903_Conan-Laure_Élisabeth-Seton | 13,046 tokens | True |
22 | 1904-1912_Rolland-Romain_Jean-Christophe(1) | 10,982 tokens | True |
23 | 1904-1912_Rolland-Romain_Jean-Christophe(2) | 10,305 tokens | True |
24 | 1917_Bourgeois-Adèle_Némoville | 12,468 tokens | True |
25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,850 tokens | True |
26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 12,144 tokens | True |
27 | 1937_Audoux-Marguerite_Douce-Lumière | 12,346 tokens | True |
28 | TOTAL | 275,489 tokens | 28 files used for cross-validation |
PREDICTIONS CONFUSION MATRIX:
Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support |
---|---|---|---|---|---|---|---|---|
PER | 30,177 | 28 | 14 | 7 | 7 | 31 | 1,940 | 32,204 |
FAC | 42 | 1,628 | 1 | 22 | 17 | 1 | 584 | 2,295 |
TIME | 8 | 1 | 965 | 1 | 1 | 0 | 695 | 1,671 |
GPE | 13 | 31 | 2 | 670 | 31 | 0 | 119 | 866 |
LOC | 8 | 64 | 1 | 56 | 357 | 0 | 295 | 781 |
VEH | 54 | 8 | 0 | 0 | 0 | 222 | 179 | 463 |
O | 2,285 | 524 | 661 | 100 | 150 | 96 | 0 | 3,816 |
CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com