Model Card for m2m100-ukr-verbalization

Model Description

m2m100-ukr-verbalization is a fine-tuned version of the facebook/m2m100_418M model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers, dates, measurements, and other non-verbal elements into their fully expanded textual representations in Ukrainian.

Architecture

This model is based on the facebook/m2m100_418M architecture, a many-to-many multilingual translation model capable of translating directly between any pair of 100 languages.

Training Data

The model was fine-tuned on a subset of Ukrainian sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents provide examples of how to transform various numeric and symbolic expressions into their appropriate textual forms in Ukrainian. Dataset: skypro1111/ubertext-2-news-verbalized

Training Procedure

The model was fine-tuned using the script provided in the repository:

# training script for fine-tuning M2M100 model on Ukrainian verbalization dataset
import os
from datasets import load_dataset
from transformers import (
    M2M100ForConditionalGeneration,
    M2M100Tokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
import torch

def main():
    model_name = "facebook/m2m100_418M"
    dataset_path = os.path.join(os.path.dirname(__file__), "dataset_ubertext_1kk_cleaned_m2m100.json")
    output_dir = os.path.join(os.path.dirname(__file__), "../m2m100-ukr-verbalization")

    # load dataset
    raw_datasets = load_dataset("json", data_files={"train": dataset_path})
    # load tokenizer and model
    tokenizer = M2M100Tokenizer.from_pretrained(model_name)
    model = M2M100ForConditionalGeneration.from_pretrained(model_name)

    # set source language to Ukrainian
    tokenizer.src_lang = "uk"
    # set target language to Ukrainian for verbatim generation
    tokenizer.tgt_lang = "uk"

    max_input_length = 128
    max_target_length = 128

    def preprocess_function(examples):
        inputs = examples["text"]
        targets = examples["verbalized"]
        model_inputs = tokenizer(
            inputs,
            max_length=max_input_length,
            truncation=True,
            padding="max_length"
        )
        # tokenize targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(
                targets,
                max_length=max_target_length,
                truncation=True,
                padding="max_length"
            )
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    # tokenize and prepare dataset
    tokenized_datasets = raw_datasets["train"].map(
        preprocess_function,
        batched=True,
        remove_columns=["text", "verbalized"],
        num_proc=8,
        load_from_cache_file=True,
    )

    # data collator for seq2seq tasks
    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=model,
        label_pad_token_id=-100
    )

    # training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        per_device_train_batch_size=28,
        num_train_epochs=3,
        learning_rate=1e-5,
        save_strategy="epoch",
        logging_steps=50,
        fp16=torch.cuda.is_available(),
    )

    # initialize trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets,
        tokenizer=tokenizer,
        data_collator=data_collator
    )

    # start training
    trainer.train()
    trainer.save_model(output_dir)

Usage

# inference script for Ukrainian verbalization model using M2M100
import os
import time
import torch
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

def process_sentence(model, tokenizer, sentence: str, device: str = "cuda"):
    # Tokenize input
    inputs = tokenizer(
        sentence,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    ).to(device)
    
    # Run inference
    with torch.inference_mode(), torch.cuda.amp.autocast():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            forced_bos_token_id=tokenizer.get_lang_id("uk"),
            max_length=128,
            num_beams=1,
            do_sample=False,
            early_stopping=True,
            use_cache=True,
            repetition_penalty=1.0,
            length_penalty=1.0,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    # Decode output
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example sentences
sentences = [
    "Моя бабуся народилася 07.11.1919, у важкий післявоєнний час.",
    "Зустріч призначена на 15:30 12.05.2025 у конференц-залі №3.",
    "Телефонуйте нам за номером +380 (44) 123-45-67 або 0800 500 123.",
    "Температура повітря сьогодні становить +25°C, а тиск 750 мм.рт.ст.",
    "ТОВ «Мрія» було засновано 28/06/2022 з початковим капіталом 50 тис. грн."
]

CTranslate2 Optimized Version

This model is also available in a CTranslate2-optimized version for faster inference:

skypro1111/m2m100-ukr-verbalization-ct2

The CTranslate2 version provides significant performance improvements:

Up to 3-5x faster inference
Reduced memory usage with int8/float16 quantization
Optimized for both CPU and GPU deployment
Better batch processing capabilities

Performance

The model demonstrates strong performance in verbalizing Ukrainian text, with particular strengths in handling:

Dates and times
Phone numbers
Measurements and units
Currency values
Numerical expressions

Benchmark (RTX 3090Ti, FP16)

1. Input : Моя бабуся народилася 07.11.1919, у важкий післявоєнний час.
   Output: Моя бабуся народилася сьомого листопада тисяча дев'ятсот дев'ятнадцятого року, у важкий післявоєнний час.
   Time  : 0.531 seconds

2. Input : Зустріч призначена на 15:30 12.05.2025 у конференц-залі №3.
   Output: Зустріч призначена на пʼятнадцяту тридцять 12.05.2025 у конференц-залі номер три.
   Time  : 0.434 seconds

3. Input : Телефонуйте нам за номером +380 (44) 123-45-67 або 0800 500 123.
   Output: Телефонуйте нам за номером плюс триста вісімдесят (сорок чотири) сто двадцять три, сорок пʼять, шістдесят сім або нуль вісімсот пʼятсот, сто двадцять три.
   Time  : 0.828 seconds

4. Input : Температура повітря сьогодні становить +25°C, а тиск 750 мм.рт.ст.
   Output: Температура повітря сьогодні становить плюс двадцять п'ять градусів Цельсія, а тиск сімсот п'ятдесят міліметрів р.т.ст.
   Time  : 0.586 seconds

5. Input : ТОВ «Мрія» було засновано 28/06/2022 з початковим капіталом 50 тис. грн.
   Output: Товариство з обмеженою відповідальністю «Мрія» було засновано двадцять восьмого червня дві тисячі двадцять другого року з початковим капіталом пʼятдесят тисяч гривень.
   Time  : 0.719 seconds

6. Input : Швидкість вітру 15 м/с, видимість 10 км, вологість 65%.
   Output: Швидкість вітру п'ятнадцять метрів за секунду, видимість десять кілометрів, вологість шістдесят п'ять відсотків.
   Time  : 0.612 seconds

7. Input : Потяг №743 Київ-Львів відправляється о 08:45 з платформи №2.
   Output: Потяг номер сімсот сорок три Київ-Львів відправляється о восьмій годині сорок пʼять хвилин з платформи номер два.
   Time  : 0.527 seconds

8. Input : Ціна на пальне зросла на 2,5 грн/л і становить 54,99 грн.
   Output: Ціна на пальне зросла на два з половиною гривні за літр і становить п'ятдесят чотири гривні дев'яносто дев'ять копійок.
   Time  : 0.634 seconds

9. Input : Площа квартири 75,5 м², висота стелі 2,75 м.
   Output: Площа квартири сімдесят п'ять цілих п'ять десятих квадратних метрів, висота стелі два цілих сімдесят п'ять сотих метрів.
   Time  : 0.582 seconds

10. Input : Відстань між містами становить 450 км або 280 миль.
    Output: Відстань між містами становить чотириста п'ятдесят кілометрів або двісті вісімдесят миль.
    Time  : 0.453 seconds

Total time: 5.90 seconds
Average time per sentence: 0.590 seconds

For improved performance, consider using the CTranslate2 optimized version: skypro1111/m2m100-ukr-verbalization-ct2 which offers:

3.6x faster inference (0.164 seconds per sentence vs 0.590)
3.1x lower memory usage (~800MB vs ~2.5GB)
Same quality of verbalization

Examples

Input: "Моя бабуся народилася 07.11.1919, у важкий післявоєнний час." Output: "Моя бабуся народилася сьомого листопада тисяча дев'ятсот дев'ятнадцятого року, у важкий післявоєнний час."

Input: "Телефонуйте нам за номером +380 (44) 123-45-67 або 0800 500 123." Output: "Телефонуйте нам за номером плюс три вісім нуль, чотири чотири, один два три, сорок п'ять, шістдесят сім або нуль вісімсот п'ятсот сто двадцять три."

Limitations and Ethical Considerations

Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.

Citation

M2M100

@article{fan2020beyond,
  title={Beyond English-Centric Multilingual Machine Translation},
  author={Fan, Angela and Bhosale, Shruti and Schwenk, Holger and Ma, Zhiyi and El-Kishky, Ahmed and Goyal, Siddharth and Baines, Mandeep and Celebi, Onur and Wenzek, Guillaume and Chaudhary, Vishrav and Goyal, Naman and Birch, Tom and Liptchinsky, Vitaliy and Edunov, Sergey and Auli, Michael and Joulin, Armand},
  journal={arXiv preprint},
  year={2020}
}

Ubertext 2.0

@inproceedings{chaplynskyi-2023-introducing,
  title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
  author = "Chaplynskyi, Dmytro",
  booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
  month = may,
  year = "2023",
  address = "Dubrovnik, Croatia",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.unlp-1.1",
  pages = "1--10",
}

License

This model is released under the MIT License, in line with the base M2M100 model.

skypro1111
/

m2m100-ukr-verbalization