ByT5-Small Polish Text Normalization

This model is a fine-tuned version of Google's ByT5-small specifically designed for Polish text normalization tasks. It converts non-standard text (abbreviations, dates, numbers, addresses) into their fully written Polish forms, making it essential for applications that require proper pronunciation and readability.

Primary Applications:

  • Text-to-Speech (TTS) Systems: Ensures dates like "15.09.2024" are pronounced as "piętnastego września dwa tysiące dwudziestego czwartego" rather than spelled out character by character
  • Speech Synthesis: Prepares text for natural-sounding voice generation by expanding abbreviations and numbers
  • Voice Assistants: Normalizes user queries and system responses for consistent audio output
  • Audiobook Production: Converts written text into speech-ready format for automated narration
  • Accessibility Tools: Helps screen readers properly pronounce Polish text with numbers, dates, and abbreviations
  • Language Learning Apps: Provides correct pronunciation examples for Polish learners
  • Broadcasting & Media: Automates script preparation for news reading and automated announcements

Model Description

Model Type: T5 (Text-to-Text Transfer Transformer)
Language: Polish
Base Model: google/byt5-small
Task: Text Normalization
License: MIT

Model Details

This model performs text normalization for Polish language, converting:

  • Dates (15 września 1631 → piętnastego września tysiąc sześćset trzydzieści pierwszego)
  • Numbers and prices (1234,56 złotych → tysiąc dwieście trzydzieści cztery złote pięćdziesiąt sześć groszy)
  • Times (14:30 → czternasta trzydzieści)
  • Addresses (ul. Marszałkowska 123/45 → ulica Marszałkowska sto dwadzieścia trzy przez czterdzieści pięć)
  • Abbreviations and other non-standard text forms

Intended Use

  • Primary Use: Polish text normalization for TTS (Text-to-Speech) systems
  • Secondary Uses:
    • Text preprocessing for speech synthesis
    • Document standardization
    • Accessibility applications
    • NLP preprocessing pipelines

Usage

Quick Start

from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Folx/byt5-small-pl-text-normalization"
tokenizer = AutoTokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# Normalize text
def normalize_text(text, model, tokenizer):
    input_text = f"normalize: {text}"
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=400,
            num_beams=2,
            early_stopping=True,
            do_sample=False
        )
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    return result

# Example usage
text = "Spotkanie odbędzie się 3 maja o godzinie 14:30."
normalized = normalize_text(text, model, tokenizer)
print(normalized)

Batch Processing

class PolishNormalizer:
    def __init__(self, model_path):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=False)
        self.model = T5ForConditionalGeneration.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        self.model.eval()
    
    def normalize(self, text, num_beams=2):
        input_text = f"normalize: {text}"
        inputs = self.tokenizer(input_text, return_tensors='pt', truncation=True)
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=400,
                num_beams=num_beams,  # 1-2 recommended for speed/quality balance
                early_stopping=True,
                do_sample=False
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

# Initialize once for multiple texts
normalizer = PolishNormalizer("Folx/byt5-small-pl-text-normalization")

texts = [
    "Dnia 15 września 1631 roku odbyła się ceremonia.",
    "Cena wynosi 1234,56 złotych.",
    "Na ul. Marszałkowskiej 123/45 mieści się sklep."
]

for text in texts:
    normalized = normalizer.normalize(text)
    print(f"Original: {text}")
    print(f"Normalized: {normalized}")
    print()

Performance

Speed Benchmarks (NVIDIA A100-SXM4-40GB)

  • Average inference time: ~0.76 seconds per text (±0.31s)
  • Range: 0.31s - 1.09s per text
  • Throughput: ~1.3 texts per second
  • Model loading time: ~0.55 seconds
  • Memory usage: ~2GB GPU memory (bfloat16)

Model Size

  • Parameters: ~300M (ByT5-small)
  • Model size: ~1.2GB
  • Precision: bfloat16 for optimal performance
  • Recommended beam size: 1-2 beams (good balance of speed vs quality)

Training Details

Training Data

  • Polish text corpus with normalization pairs
  • Domains: dates, numbers, currencies, addresses, abbreviations
  • Training examples: Various Polish text normalization patterns

Training Procedure

  • Base model: google/byt5-small
  • Fine-tuning approach: Task-specific fine-tuning for Polish normalization
  • Input format: "normalize: {text_to_normalize}"
  • Output format: Normalized Polish text

Hyperparameters

  • Max input length: 512 tokens
  • Max output length: 400 tokens
  • Beam search: 2 beams (recommended)
  • Precision: bfloat16

Limitations and Bias

Limitations

  • Designed specifically for Polish language
  • May not handle very rare abbreviations or domain-specific terminology
  • Performance may vary with very long texts (>512 tokens)
  • Inference time can vary significantly (0.3-1.1s) depending on text complexity
  • Requires GPU for reasonable inference speed (~0.76s per text on A100)

Bias Considerations

  • Training data may reflect biases present in Polish text corpora
  • Model may have regional variations in normalization preferences
  • Users should validate outputs for critical applications

Example Outputs

Original Text Normalized Text
"Dnia 15 września 1631 roku odbyła się ceremonia." "Dnia piętnastego września tysiąc sześćset trzydziestego pierwszego roku odbyła się ceremonia."
"W roku 2024 nastąpi wielka zmiana." "W roku dwa tysiące dwudziestym czwartym nastąpi wielka zmiana."
"Spotkanie odbędzie się 3 maja o godzinie 14:30." "Spotkanie odbędzie się trzeciego maja o godzinie czternastej trzydzieści."
"Cena wynosi 1234,56 złotych." "Cena wynosi tysiąc dwieście trzydzieści cztery złote pięćdziesiąt sześć groszy."
"Na ul. Marszałkowskiej 123/45 mieści się sklep." "Na ulicy Marszałkowskiej sto dwadzieścia trzy na czterdzieści pięć mieści się sklep."

Requirements

torch>=1.9.0
transformers>=4.20.0
numpy

Installation

pip install torch transformers numpy

Model Card Contact

For questions about this model, please open an issue on the model repository or contact [email protected].


Note: This model is optimized for Polish text normalization. For other languages or tasks, consider using language-specific models or the base ByT5 model with appropriate fine-tuning.

Downloads last month
375
Safetensors
Model size
300M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Folx/byt5-small-pl-text-normalization

Base model

google/byt5-small
Finetuned
(49)
this model