impresso-project
/

language-identifier

Token Classification

language-identification

Model card Files Files and versions Community

emanuelaboros commited on Apr 16

Commit

eeffb50

·

1 Parent(s): 4262db7

update readme

Files changed (1) hide show

README.md +65 -8

README.md CHANGED Viewed

@@ -1,19 +1,37 @@
 ---
 license: agpl-3.0
 ---
-## impresso-langident
-Detects the language for impresso-like historical newspaper data in the languages:
-German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb).
-## How to install
-```bash
-pip install transformers floret
-```
-## How to run:
 ```python
 from transformers import pipeline
@@ -34,3 +52,42 @@ face à une opportunité."""
 langs = lang_pipeline(text)
 print(langs)
 ```

 ---
+library_name: transformers
+language:
+- fr
+- de
+- en
+- it
+- lb
 license: agpl-3.0
+tags:
+- language-identification
+- multilingual
+- historical
+- impresso
 ---
+# Model Card for impresso-project/language-identifier
+## Overview
+`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
+This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
+## Model Details
+- **Model type:** Language identification
+- **Interface:** Hugging Face `transformers` pipeline
+- **Languages supported:** fr, de, en, it, lb
+- **License:** AGPL-3.0
+- **Developed by:** UZH, Switzerland
+- **Training data:** Historical newspapers from the impresso corpus and related sources
+## How to Use
 ```python
 from transformers import pipeline
 langs = lang_pipeline(text)
 print(langs)
 ```
+## Output Format
+The output is a single dictionary with the predicted language and confidence score:
+```python
+{
+  "language": "fr",
+  "score": 1.0
+}
+```
+## Use Cases
+- Preprocessing for OCR and NLP tasks on historical corpora
+- Document and segment-level language tagging
+- Filtering and sorting multilingual newspaper archives
+## Limitations
+- Works best on **sentence- or paragraph-length** texts
+- May struggle with code-switching or OCR-degraded text that mixes languages
+- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
+## Installation
+```bash
+pip install transformers floret
+```
+## Contact
+- Website: [https://impresso-project.ch](https://impresso-project.ch)
+<p align="center">
+  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
+</p>