|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- cs |
|
- sk |
|
tags: |
|
- electra |
|
- small |
|
- bilingual |
|
--- |
|
|
|
# Bilingual ELECTRA (Czech-Slovak) |
|
|
|
Bilingual ELECTRA (Czech-Slovak) is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a mixed Czech and Slovak corpus. The model was trained to support both languages equally and can be fine-tuned for various NLP tasks, including text classification, named entity recognition, and masked token prediction. The model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), which allows commercial use. |
|
|
|
### Tokenization |
|
|
|
The model uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for proper tokenization. You can use either the HuggingFace AutoTokenizer (recommended) or SentencePiece directly. |
|
|
|
#### Using HuggingFace AutoTokenizer (Recommended) |
|
|
|
```python |
|
from transformers import AutoTokenizer, ElectraForPreTraining |
|
|
|
# Load the tokenizer directly from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/BiELECTRA-czech-slovak") |
|
|
|
# Or load from local directory |
|
# tokenizer = AutoTokenizer.from_pretrained("./CZSK") |
|
|
|
# Load the pretrained model |
|
model = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak") |
|
|
|
# Tokenize input text |
|
sentence = "Toto je testovací věta v češtině a slovenčine." |
|
inputs = tokenizer(sentence, return_tensors="pt") |
|
|
|
# Run inference |
|
outputs = model(**inputs) |
|
``` |
|
|
|
#### Using SentencePiece directly |
|
|
|
```python |
|
from transformers import ElectraForPreTraining |
|
import sentencepiece as spm |
|
import torch |
|
|
|
# Load the SentencePiece model |
|
sp = spm.SentencePieceProcessor() |
|
sp.load("m.model") |
|
|
|
# Load the pretrained model |
|
discriminator = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak") |
|
|
|
# Tokenize input text (note: input should be lowercase) |
|
sentence = "toto je testovací věta v češtině a slovenčine." |
|
tokens = sp.encode(sentence, out_type=str) |
|
token_ids = sp.encode(sentence) |
|
|
|
# Convert to tensor |
|
input_tensor = torch.tensor([token_ids]) |
|
|
|
# Run inference |
|
outputs = discriminator(input_tensor) |
|
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy() |
|
``` |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
This model was published as part of the research paper: |
|
|
|
**"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"** |
|
*Martin Poláček, Petr Červa* |
|
*RANLP Student Workshop 2025* |
|
|
|
Citation information will be provided after the conference publication. |
|
|
|
--- |
|
|
|
## Related Models |
|
|
|
- **Multilingual**: [AILabTUL/mELECTRA](https://huggingface.co/AILabTUL/mELECTRA) |
|
- **Norwegian-Swedish**: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish) |