|
--- |
|
license: cc-by-sa-4.0 |
|
language: |
|
- cro |
|
tags: |
|
- word spelling error annotator |
|
--- |
|
|
|
--- |
|
language: |
|
- cro |
|
|
|
license: cc-by-sa-4.0 |
|
--- |
|
|
|
# BERTic-Incorrect-Spelling-Annotator |
|
|
|
This BERTic model is designed to annotate incorrectly spelled words in text. It utilizes the following labels: |
|
|
|
- 0: Word is written correctly, |
|
- 1: Word is written incorrectly. |
|
|
|
## Model Output Example |
|
|
|
Imagine we have the following Croatian text: |
|
|
|
_Model u tekstu prepoznije riječi u kojima se nalazaju pogreške ._ |
|
|
|
If we convert input data to format acceptable by BERTic model: |
|
|
|
_[CLS] model [MASK] u [MASK] tekstu [MASK] prepo ##znije [MASK] riječi [MASK] u [MASK] kojima [MASK] se [MASK] nalaza ##ju [MASK] pogreške [MASK] . [MASK] [SEP]_ |
|
|
|
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!): |
|
|
|
_Model 0 u 0 tekstu 0 prepoznije 1 riječi 0 u 0 kojima 0 se 0 nalazaju 1 pogreške 0 . 0_ |
|
|
|
We can observe that in the input sentence, the word `prepoznije` and `nalazaju` are spelled incorrectly, so the model marks them with the token (1). |
|
|
|
## More details |
|
|
|
Testing model with **generated** test sets provides following result: |
|
|
|
Precision: 0.9954 |
|
Recall: 0.8764 |
|
F1 Score: 0.9321 |
|
F0.5 Score: 0.9691 |
|
|
|
Testing the model with test sets constructed using the **Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0** dataset provides the following results: |
|
|
|
Precision: 0.8213 |
|
Recall: 0.3921 |
|
F1 Score: 0.5308 |
|
F0.5 Score: 0.6738 |
|
|
|
## Acknowledgement |
|
|
|
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills. |
|
|
|
## Authors |
|
|
|
Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing this model. |