ByT5-Small Polish Text Normalization
This model is a fine-tuned version of Google's ByT5-small specifically designed for Polish text normalization tasks. It converts non-standard text (abbreviations, dates, numbers, addresses) into their fully written Polish forms, making it essential for applications that require proper pronunciation and readability.
Primary Applications:
- Text-to-Speech (TTS) Systems: Ensures dates like "15.09.2024" are pronounced as "piętnastego września dwa tysiące dwudziestego czwartego" rather than spelled out character by character
- Speech Synthesis: Prepares text for natural-sounding voice generation by expanding abbreviations and numbers
- Voice Assistants: Normalizes user queries and system responses for consistent audio output
- Audiobook Production: Converts written text into speech-ready format for automated narration
- Accessibility Tools: Helps screen readers properly pronounce Polish text with numbers, dates, and abbreviations
- Language Learning Apps: Provides correct pronunciation examples for Polish learners
- Broadcasting & Media: Automates script preparation for news reading and automated announcements
Model Description
Model Type: T5 (Text-to-Text Transfer Transformer)
Language: Polish
Base Model: google/byt5-small
Task: Text Normalization
License: MIT
Model Details
This model performs text normalization for Polish language, converting:
- Dates (15 września 1631 → piętnastego września tysiąc sześćset trzydzieści pierwszego)
- Numbers and prices (1234,56 złotych → tysiąc dwieście trzydzieści cztery złote pięćdziesiąt sześć groszy)
- Times (14:30 → czternasta trzydzieści)
- Addresses (ul. Marszałkowska 123/45 → ulica Marszałkowska sto dwadzieścia trzy przez czterdzieści pięć)
- Abbreviations and other non-standard text forms
Intended Use
- Primary Use: Polish text normalization for TTS (Text-to-Speech) systems
- Secondary Uses:
- Text preprocessing for speech synthesis
- Document standardization
- Accessibility applications
- NLP preprocessing pipelines
Usage
Quick Start
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "Folx/byt5-small-pl-text-normalization"
tokenizer = AutoTokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# Normalize text
def normalize_text(text, model, tokenizer):
input_text = f"normalize: {text}"
inputs = tokenizer(input_text, return_tensors='pt', truncation=True)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=400,
num_beams=2,
early_stopping=True,
do_sample=False
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
return result
# Example usage
text = "Spotkanie odbędzie się 3 maja o godzinie 14:30."
normalized = normalize_text(text, model, tokenizer)
print(normalized)
Batch Processing
class PolishNormalizer:
def __init__(self, model_path):
self.tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=False)
self.model = T5ForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
self.model.eval()
def normalize(self, text, num_beams=2):
input_text = f"normalize: {text}"
inputs = self.tokenizer(input_text, return_tensors='pt', truncation=True)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_length=400,
num_beams=num_beams, # 1-2 recommended for speed/quality balance
early_stopping=True,
do_sample=False
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
# Initialize once for multiple texts
normalizer = PolishNormalizer("Folx/byt5-small-pl-text-normalization")
texts = [
"Dnia 15 września 1631 roku odbyła się ceremonia.",
"Cena wynosi 1234,56 złotych.",
"Na ul. Marszałkowskiej 123/45 mieści się sklep."
]
for text in texts:
normalized = normalizer.normalize(text)
print(f"Original: {text}")
print(f"Normalized: {normalized}")
print()
Performance
Speed Benchmarks (NVIDIA A100-SXM4-40GB)
- Average inference time: ~0.76 seconds per text (±0.31s)
- Range: 0.31s - 1.09s per text
- Throughput: ~1.3 texts per second
- Model loading time: ~0.55 seconds
- Memory usage: ~2GB GPU memory (bfloat16)
Model Size
- Parameters: ~300M (ByT5-small)
- Model size: ~1.2GB
- Precision: bfloat16 for optimal performance
- Recommended beam size: 1-2 beams (good balance of speed vs quality)
Training Details
Training Data
- Polish text corpus with normalization pairs
- Domains: dates, numbers, currencies, addresses, abbreviations
- Training examples: Various Polish text normalization patterns
Training Procedure
- Base model: google/byt5-small
- Fine-tuning approach: Task-specific fine-tuning for Polish normalization
- Input format:
"normalize: {text_to_normalize}"
- Output format: Normalized Polish text
Hyperparameters
- Max input length: 512 tokens
- Max output length: 400 tokens
- Beam search: 2 beams (recommended)
- Precision: bfloat16
Limitations and Bias
Limitations
- Designed specifically for Polish language
- May not handle very rare abbreviations or domain-specific terminology
- Performance may vary with very long texts (>512 tokens)
- Inference time can vary significantly (0.3-1.1s) depending on text complexity
- Requires GPU for reasonable inference speed (~0.76s per text on A100)
Bias Considerations
- Training data may reflect biases present in Polish text corpora
- Model may have regional variations in normalization preferences
- Users should validate outputs for critical applications
Example Outputs
Original Text | Normalized Text |
---|---|
"Dnia 15 września 1631 roku odbyła się ceremonia." | "Dnia piętnastego września tysiąc sześćset trzydziestego pierwszego roku odbyła się ceremonia." |
"W roku 2024 nastąpi wielka zmiana." | "W roku dwa tysiące dwudziestym czwartym nastąpi wielka zmiana." |
"Spotkanie odbędzie się 3 maja o godzinie 14:30." | "Spotkanie odbędzie się trzeciego maja o godzinie czternastej trzydzieści." |
"Cena wynosi 1234,56 złotych." | "Cena wynosi tysiąc dwieście trzydzieści cztery złote pięćdziesiąt sześć groszy." |
"Na ul. Marszałkowskiej 123/45 mieści się sklep." | "Na ulicy Marszałkowskiej sto dwadzieścia trzy na czterdzieści pięć mieści się sklep." |
Requirements
torch>=1.9.0
transformers>=4.20.0
numpy
Installation
pip install torch transformers numpy
Model Card Contact
For questions about this model, please open an issue on the model repository or contact [email protected].
Note: This model is optimized for Polish text normalization. For other languages or tasks, consider using language-specific models or the base ByT5 model with appropriate fine-tuning.
- Downloads last month
- 375
Model tree for Folx/byt5-small-pl-text-normalization
Base model
google/byt5-small