AILabTUL
/

BiELECTRA-czech-slovak

Model card Files Files and versions

BiELECTRA-czech-slovak / README.md

mpolacek's picture

Upload 5 files

8ab5d88 verified about 2 months ago

|

history blame contribute delete

2.69 kB

	---
	license: cc-by-4.0
	language:
	- cs
	- sk
	tags:
	- electra
	- small
	- bilingual
	---

	# Bilingual ELECTRA (Czech-Slovak)

	Bilingual ELECTRA (Czech-Slovak) is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a mixed Czech and Slovak corpus. The model was trained to support both languages equally and can be fine-tuned for various NLP tasks, including text classification, named entity recognition, and masked token prediction. The model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), which allows commercial use.

	### Tokenization

	The model uses a SentencePiece tokenizer and requires a SentencePiece model file (`m.model`) for proper tokenization. You can use either the HuggingFace AutoTokenizer (recommended) or SentencePiece directly.

	#### Using HuggingFace AutoTokenizer (Recommended)

	```python
	from transformers import AutoTokenizer, ElectraForPreTraining

	# Load the tokenizer directly from HuggingFace Hub
	tokenizer = AutoTokenizer.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

	# Or load from local directory
	# tokenizer = AutoTokenizer.from_pretrained("./CZSK")

	# Load the pretrained model
	model = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

	# Tokenize input text
	sentence = "Toto je testovací věta v češtině a slovenčine."
	inputs = tokenizer(sentence, return_tensors="pt")

	# Run inference
	outputs = model(**inputs)
	```

	#### Using SentencePiece directly

	```python
	from transformers import ElectraForPreTraining
	import sentencepiece as spm
	import torch

	# Load the SentencePiece model
	sp = spm.SentencePieceProcessor()
	sp.load("m.model")

	# Load the pretrained model
	discriminator = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

	# Tokenize input text (note: input should be lowercase)
	sentence = "toto je testovací věta v češtině a slovenčine."
	tokens = sp.encode(sentence, out_type=str)
	token_ids = sp.encode(sentence)

	# Convert to tensor
	input_tensor = torch.tensor([token_ids])

	# Run inference
	outputs = discriminator(input_tensor)
	predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()
	```

	---

	## Citation

	This model was published as part of the research paper:

	"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"
	Martin Poláček, Petr Červa
	RANLP Student Workshop 2025

	Citation information will be provided after the conference publication.

	---

	## Related Models

	- Multilingual: [AILabTUL/mELECTRA](https://huggingface.co/AILabTUL/mELECTRA)
	- Norwegian-Swedish: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish)