mpolacek's picture
Upload 5 files
8ab5d88 verified
---
license: cc-by-4.0
language:
- cs
- sk
tags:
- electra
- small
- bilingual
---
# Bilingual ELECTRA (Czech-Slovak)
Bilingual ELECTRA (Czech-Slovak) is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a mixed Czech and Slovak corpus. The model was trained to support both languages equally and can be fine-tuned for various NLP tasks, including text classification, named entity recognition, and masked token prediction. The model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), which allows commercial use.
### Tokenization
The model uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for proper tokenization. You can use either the HuggingFace AutoTokenizer (recommended) or SentencePiece directly.
#### Using HuggingFace AutoTokenizer (Recommended)
```python
from transformers import AutoTokenizer, ElectraForPreTraining
# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")
# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./CZSK")
# Load the pretrained model
model = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")
# Tokenize input text
sentence = "Toto je testovací věta v češtině a slovenčine."
inputs = tokenizer(sentence, return_tensors="pt")
# Run inference
outputs = model(**inputs)
```
#### Using SentencePiece directly
```python
from transformers import ElectraForPreTraining
import sentencepiece as spm
import torch
# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")
# Load the pretrained model
discriminator = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")
# Tokenize input text (note: input should be lowercase)
sentence = "toto je testovací věta v češtině a slovenčine."
tokens = sp.encode(sentence, out_type=str)
token_ids = sp.encode(sentence)
# Convert to tensor
input_tensor = torch.tensor([token_ids])
# Run inference
outputs = discriminator(input_tensor)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()
```
---
## Citation
This model was published as part of the research paper:
**"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"**
*Martin Poláček, Petr Červa*
*RANLP Student Workshop 2025*
Citation information will be provided after the conference publication.
---
## Related Models
- **Multilingual**: [AILabTUL/mELECTRA](https://huggingface.co/AILabTUL/mELECTRA)
- **Norwegian-Swedish**: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish)