emanuelaboros commited on
Commit
eeffb50
·
1 Parent(s): 4262db7

update readme

Browse files
Files changed (1) hide show
  1. README.md +65 -8
README.md CHANGED
@@ -1,19 +1,37 @@
1
  ---
 
 
 
 
 
 
 
2
  license: agpl-3.0
 
 
 
 
 
3
  ---
4
 
5
- ## impresso-langident
6
 
7
- Detects the language for impresso-like historical newspaper data in the languages:
8
- German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb).
9
 
10
- ## How to install
11
 
12
- ```bash
13
- pip install transformers floret
14
- ```
15
 
16
- ## How to run:
 
 
 
 
 
 
 
 
 
17
 
18
  ```python
19
  from transformers import pipeline
@@ -34,3 +52,42 @@ face à une opportunité."""
34
  langs = lang_pipeline(text)
35
  print(langs)
36
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
+ language:
4
+ - fr
5
+ - de
6
+ - en
7
+ - it
8
+ - lb
9
  license: agpl-3.0
10
+ tags:
11
+ - language-identification
12
+ - multilingual
13
+ - historical
14
+ - impresso
15
  ---
16
 
17
+ # Model Card for impresso-project/language-identifier
18
 
19
+ ## Overview
 
20
 
21
+ `impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
22
 
23
+ This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
 
 
24
 
25
+ ## Model Details
26
+
27
+ - **Model type:** Language identification
28
+ - **Interface:** Hugging Face `transformers` pipeline
29
+ - **Languages supported:** fr, de, en, it, lb
30
+ - **License:** AGPL-3.0
31
+ - **Developed by:** UZH, Switzerland
32
+ - **Training data:** Historical newspapers from the impresso corpus and related sources
33
+
34
+ ## How to Use
35
 
36
  ```python
37
  from transformers import pipeline
 
52
  langs = lang_pipeline(text)
53
  print(langs)
54
  ```
55
+
56
+ ## Output Format
57
+
58
+ The output is a single dictionary with the predicted language and confidence score:
59
+
60
+ ```python
61
+ {
62
+ "language": "fr",
63
+ "score": 1.0
64
+ }
65
+ ```
66
+
67
+
68
+ ## Use Cases
69
+
70
+ - Preprocessing for OCR and NLP tasks on historical corpora
71
+ - Document and segment-level language tagging
72
+ - Filtering and sorting multilingual newspaper archives
73
+
74
+ ## Limitations
75
+
76
+ - Works best on **sentence- or paragraph-length** texts
77
+ - May struggle with code-switching or OCR-degraded text that mixes languages
78
+ - Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
79
+
80
+ ## Installation
81
+
82
+ ```bash
83
+ pip install transformers floret
84
+ ```
85
+
86
+ ## Contact
87
+
88
+ - Website: [https://impresso-project.ch](https://impresso-project.ch)
89
+
90
+ <p align="center">
91
+ <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
92
+ </p>
93
+