File size: 5,095 Bytes
099d4df
 
 
 
 
 
 
 
 
 
 
 
 
b7d73f6
099d4df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7decb48
099d4df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language: ru
license: apache-2.0
library_name: transformers
tags:
- russian
- morpheme-segmentation
- token-classification
- morphbert
- bert
- ru
- russ
pipeline_tag: token-classification
new_version: CrabInHoney/morphbert-tiny-v2-morpheme-segmentation-ru
---

# MorphBERT-Large: Russian Morpheme Segmentation

This repository contains the `CrabInHoney/morphbert-large-morpheme-segmentation-ru` model, a большая transformer-based system fine-tuned for morpheme segmentation of Russian words. The model classifies each character of a given word into one of 25 morpheme categories: ['END', 'END1', 'HYPH', 'HYPH1', 'LINK', 'LINK1', 'LINK2', 'LINK3', 'POSTFIX', 'PREF', 'PREF1', 'PREF2', 'ROOT', 'ROOT1', 'ROOT2', 'ROOT3', 'ROOT4', 'ROOT5', 'SUFF', 'SUFF1', 'SUFF2', 'SUFF3', 'SUFF4', 'SUFF5', 'SUFF6']

## Model Description

`morphbert-large-morpheme-segmentation-ru` uses the powerful transformer architecture, aimed at more accurate prediction of morphological analysis at the character level. Due to its large size, the model demonstrates greater accuracy in determining the constituent morphemes in Russian words compared to the small version (CrabInHoney/morphbert-tiny-morpheme-segmentation-ru).

The model was obtained by learning from scratch, the architecture is comparable in complexity to bert-base.

**Key Features:**

*   **Task:** Morpheme Segmentation (Token Classification at Character Level)
*   **Language:** Russian (ru)
*   **Architecture:** Transformer (BERT base -like)
*   **Labels:** ['END', 'END1', 'HYPH', 'HYPH1', 'LINK', 'LINK1', 'LINK2', 'LINK3', 'POSTFIX', 'PREF', 'PREF1', 'PREF2', 'ROOT', 'ROOT1', 'ROOT2', 'ROOT3', 'ROOT4', 'ROOT5', 'SUFF', 'SUFF1', 'SUFF2', 'SUFF3', 'SUFF4', 'SUFF5', 'SUFF6']

**Model Size & Specifications:**

*   **Parameters:** ~85.5 Million
*   **Tensor Type:** F32
*   **Disk Footprint:** ~342 MB

## Usage

The model can be easily used with the Hugging Face `transformers` library. It processes words character by character.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "CrabInHoney/morphbert-large-morpheme-segmentation-ru"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

def analyze(word):
    tokens = list(word)
    encoded = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=34)
    with torch.no_grad():
        logits = model(**encoded).logits
        predictions = logits.argmax(dim=-1)[0]
    
    word_ids = encoded.word_ids()
    output = []
    current_label = None
    current_chunk = []

    for i, word_idx in enumerate(word_ids):
        if word_idx is not None and word_idx < len(tokens):
            label_id = predictions[i].item()
            label = model.config.id2label[label_id]  
            token = tokens[word_idx]
            
            if label == current_label:
                current_chunk.append(token)
            else:
                if current_chunk:
                    chunk_str = "".join(current_chunk)
                    output.append(f"{chunk_str}:{current_label}")
                current_chunk = [token]
                current_label = label
    if current_chunk:
        chunk_str = "".join(current_chunk)
        output.append(f"{chunk_str}:{current_label}")

    return " / ".join(output)

# Примеры
for word in ["масляный", "предчувствий", "тарковский", "кот", "подгон", "сине-белый", "шторы", "абажур", "дедлайн", "веб-сайт", "адаптированная", "формообразующий"]:
    print(f"{word} → {analyze(word)}")

```

## Example Predictions

```
масляный → масл:ROOT / ян:SUFF / ый:END
предчувствий → пред:PREF / чу:ROOT / в:SUFF / ств:SUFF1 / ий:END
тарковский → тарк:ROOT / ов:SUFF / ск:SUFF1 / ий:END
кот → кот:ROOT
подгон → под:PREF / гон:ROOT
сине-белый → син:ROOT / е:LINK / -:HYPH / бел:ROOT1 / ый:END
шторы → штор:ROOT / ы:END
абажур → абажур:ROOT
дедлайн → дедлайн:ROOT
веб-сайт → веб:ROOT / -:HYPH / сайт:ROOT1
адаптированная → адапт:ROOT / ир:SUFF / ова:SUFF1 / нн:SUFF2 / ая:END
формообразующий → форм:ROOT / о:LINK / образу:ROOT1 / ющ:SUFF / ий:END
```

## Performance

The model achieves an approximate character-level accuracy of **0.99** on its evaluation dataset. 

## Limitations

*   Performance may vary on out-of-vocabulary words, neologisms, or highly complex morphological structures not sufficiently represented in the training data.
*   The model operates strictly at the character level; it does not incorporate broader lexical or syntactic context.
*   Ambiguous cases in morpheme boundaries might be resolved based on patterns learned during training, which may not always align with linguistic conventions in edge cases.