Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,54 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# Greek-Lesbian Morphosyntactic Model (Stanza + Greek BERT)
|
6 |
+
|
7 |
+
This repository hosts a morphosyntactic model trained using [Stanza](https://stanfordnlp.github.io/stanza/) and fine-tuned with [Greek BERT](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1) for the **Lesbian dialect of Greek** (spoken on the island of Lesbos). The model has been trained and evaluated on a small, curated treebank of 540 sentences (500 for training, 30 for testing, 10 for development).
|
8 |
+
|
9 |
+
The model aims to support part-of-speech tagging, morphological analysis, and dependency parsing for dialectal Greek and is part of a broader effort to document and process regional language varieties.
|
10 |
+
|
11 |
+
## 📚 Dataset
|
12 |
+
|
13 |
+
The treebank is a manually annotated resource compiled from both **oral** and **written** sources. Oral data were collected between 2023 and 2024 from speakers in various villages of Lesbos:
|
14 |
+
|
15 |
+
* **Agra** (Male speaker)
|
16 |
+
* **Chidira** (Female speaker)
|
17 |
+
* **Eressos** (Male speaker)
|
18 |
+
* **Pterounta** (Female speaker)
|
19 |
+
* **Mesotopos** (Male speaker)
|
20 |
+
* **Parakoila** (Female speaker)
|
21 |
+
|
22 |
+
Written sources include:
|
23 |
+
|
24 |
+
* Papanis, D. & Papanis, G. D. (2004). *Lexiko tou Agiasotikou Glosikou Idiomatos*
|
25 |
+
* Tsokarou-Mitsioni, E. (1998). *Palies Istories ap' tn Agiasiou*
|
26 |
+
* Tsokarou-Mitsioni, E. (2019). *Prosfygiá*
|
27 |
+
* Anagnostopoulou, M. A. (2021). *Thematiko Lexiko tis Lesviakis Dialektou*
|
28 |
+
* Anagnostou, V. T. (2014). *Tsi sta th'ka mas: Komodia sta k'stariot'ka*
|
29 |
+
|
30 |
+
The full treebank is openly available here:
|
31 |
+
🔗 [UD\_Greek-Lesbian on GitHub](https://github.com/UniversalDependencies/UD_Greek-Lesbian)
|
32 |
+
|
33 |
+
## 🧠 Model Architecture
|
34 |
+
|
35 |
+
* **Base pipeline**: Stanza (v1.7.0+)
|
36 |
+
* **Pretrained LM**: [Greek BERT](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)
|
37 |
+
* **Tasks**: Tokenization, Lemmatization, POS tagging, Morphological features, Dependency parsing
|
38 |
+
* **Fine-tuning**: Conducted on the UD\_Greek-Lesbian treebank
|
39 |
+
|
40 |
+
## 📈 Performance
|
41 |
+
|
42 |
+
Due to the limited size of the training data, the model should be considered **experimental**. It is optimized for research purposes and performs best on dialectal content similar to the training sources. Further fine-tuning and larger datasets will be necessary for production use.
|
43 |
+
|
44 |
+
## 📄 Citation
|
45 |
+
|
46 |
+
If you use this model or the accompanying treebank, please cite:
|
47 |
+
|
48 |
+
> Bompolas, S., Markantonatou, S., Ralli, A., & Anastasopoulos, A. (2025). *Crossing Dialectal Boundaries: Building a Treebank for the Dialect of Lesbos through Knowledge Transfer from Standard Modern Greek*. In Proceedings of the 8th Universal Dependencies Workshop (UDW, SyntaxFest 2025). Association for Computational Linguistics.
|
49 |
+
|
50 |
+
## 🔗 Related Resources
|
51 |
+
|
52 |
+
* [Universal Dependencies (UD)](https://universaldependencies.org/)
|
53 |
+
* [Stanza Documentation](https://stanfordnlp.github.io/stanza/)
|
54 |
+
* [Greek BERT on Hugging Face](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)
|