Bilingual ELECTRA (Norwegian-Swedish)

Bilingual ELECTRA (Norwegian-Swedish) is an Electra-small model pretrained on a mixed Norwegian and Swedish corpus. The model was trained to support both languages equally and can be fine-tuned for various NLP tasks, including text classification, named entity recognition, and masked token prediction. The model is released under the CC BY 4.0 license, which allows commercial use.

Tokenization

The model uses a SentencePiece tokenizer and requires a SentencePiece model file (m.model) for proper tokenization. You can use either the HuggingFace AutoTokenizer (recommended) or SentencePiece directly.

Using HuggingFace AutoTokenizer (Recommended)

from transformers import AutoTokenizer, ElectraForPreTraining

# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/BiELECTRA-norwegian-swedish")

# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./NOSWE")

# Load the pretrained model
model = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-norwegian-swedish")

# Tokenize input text
sentence = "Dette er en testsetning på norsk og svenska."
inputs = tokenizer(sentence, return_tensors="pt")

# Run inference
outputs = model(**inputs)

Using SentencePiece directly

from transformers import ElectraForPreTraining
import sentencepiece as spm
import torch

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")

# Load the pretrained model
discriminator = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-norwegian-swedish")

# Tokenize input text (note: input should be lowercase)
sentence = "dette er en testsetning på norsk og svenska."
tokens = sp.encode(sentence, out_type=str)
token_ids = sp.encode(sentence)

# Convert to tensor
input_tensor = torch.tensor([token_ids])

# Run inference
outputs = discriminator(input_tensor)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()

Citation

This model was published as part of the research paper:

"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"
Martin Poláček, Petr Červa
RANLP Student Workshop 2025

Citation information will be provided after the conference publication.

Related Models

Multilingual: AILabTUL/mELECTRA
Czech-Slovak: AILabTUL/BiELECTRA-czech-slovak

Downloads last month: 2