AntoineBourgois's picture
Upload 3 files
931af1d verified
metadata
language: fr
tags:
  - NER
  - camembert
  - literary-texts
  - nested-entities
  - BookNLP-fr
license: apache-2.0
metrics:
  - f1
  - precision
  - recall
base_model:
  - almanach/camembert-large
pipeline_tag: token-classification

INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 92.46% 93.71% 93.08% 32,204 84.13%
FAC 70.63% 70.94% 70.78% 2,295 6.00%
TIME 58.66% 57.75% 58.20% 1,671 4.37%
GPE 77.64% 77.37% 77.50% 866 2.26%
LOC 62.96% 45.71% 52.97% 781 2.04%
VEH 63.43% 47.95% 54.61% 463 1.21%
micro_avg 88.39% 88.87% 88.58% 38,280 100.00%
macro_avg 70.96% 65.57% 67.86% 38,280 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (28 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 1024 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 25 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote 24,776 tokens True
1 1830_Balzac-Honoré-de_Sarrasine 15,408 tokens True
2 1836_Gautier-Théophile_La-morte-amoureuse 14,293 tokens True
3 1837_Balzac-Honoré-de_La-maison-Nucingen 30,034 tokens True
4 1841_Sand-George_Pauline 12,398 tokens True
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
6 1863_Gautier-Théophile_Le-capitaine-Fracasse 11,848 tokens True
7 1873_Zola-Émile_Le-ventre-de-Paris 12,613 tokens True
8 1881_Flaubert-Gustave_Bouvard-et-Pécuchet 12,308 tokens True
9 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche 2,267 tokens True
10 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique 2,041 tokens True
11 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille 2,949 tokens True
12 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste 2,578 tokens True
13 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca 4,078 tokens True
14 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval 2,878 tokens True
15 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou 1,905 tokens True
16 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi 5,439 tokens True
17 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil 2,159 tokens True
18 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon 2,364 tokens True
19 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse 2,469 tokens True
20 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis 12,775 tokens True
21 1903_Conan-Laure_Élisabeth-Seton 13,046 tokens True
22 1904-1912_Rolland-Romain_Jean-Christophe(1) 10,982 tokens True
23 1904-1912_Rolland-Romain_Jean-Christophe(2) 10,305 tokens True
24 1917_Bourgeois-Adèle_Némoville 12,468 tokens True
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,850 tokens True
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 12,144 tokens True
27 1937_Audoux-Marguerite_Douce-Lumière 12,346 tokens True
28 TOTAL 275,489 tokens 28 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER FAC TIME GPE LOC VEH O support
PER 30,177 28 14 7 7 31 1,940 32,204
FAC 42 1,628 1 22 17 1 584 2,295
TIME 8 1 965 1 1 0 695 1,671
GPE 13 31 2 670 31 0 119 866
LOC 8 64 1 56 357 0 295 781
VEH 54 8 0 0 0 222 179 463
O 2,285 524 661 100 150 96 0 3,816

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com