mbart-lt-summary-phase5

This model is a fine-tuned version of facebook/mbart-large-50 on Lithuanian article dataset.

Model description

This model is an mBART-large-50 sequence-to-sequence transformer fine-tuned for abstractive summarization of Lithuanian texts. The base model (facebook/mbart-large-50) was adapted to the summarization task using a cascaded five-phase fine-tuning process. This multi-phase approach was designed to incrementally train the model despite limited GPU memory, by gradually unfreezing components and later applying low-rank adaptation. The resulting model can generate concise Lithuanian summaries of texts, retaining the main ideas in fluent Lithuanian.

Intended uses & limitations

Intended uses: This model is intended for generating abstractive summaries of Lithuanian articles and similar formal texts. Given a Lithuanian article (up to 1024 tokens), it produces a shorter text (128 tokens) summarizing the content. It is useful for news aggregation, assisting readers in quickly understanding lengthy articles, or as a starting point for editors writing summaries. The model works best on input text that is formal, factual, and written in standard Lithuanian (e.g. news, reports, informational texts).

Limitations: If prompted with text from domains very different from news (e.g. fictional stories, casual conversation, or technical jargon) or text containing non-standard Lithuanian (slang, dialects, code-mixing), the quality of the summary may degrade. The model might omit important details or produce awkward phrasing when the input style deviates from what it saw in training. Additionally, as an abstractive summarizer, it may generate information not present in the source (a form of hallucination) if the input is unclear or the model overgeneralizes. Users should double-check critical facts in the summaries against the original text. The model is not designed for translation or for summarizing texts in languages other than Lithuanian.

Code usage example

Summary generation from single text

  from transformers import MBartForConditionalGeneration, MBartTokenizer

  model = MBartForConditionalGeneration.from_pretrained("Arnold001/mbart-lt-summary-phase5")
  tokenizer = MBartTokenizer.from_pretrained("Arnold001/mbart-lt-summary-phase5", src_lang="lt_LT", tgt_lang="lt_LT")

  text = "Čia yra pilnas straipsnio tekstas, iš kurio generuosime santrauką."

  #if num_beams will be increased, it might give better results, but it will affect the speed of summary generation

  inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
  summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

  print("Santrauka:", summary)

Summary generation from CSV (single or multiple)

  import pandas as pd

  #a file with article text must have a column named "text"
  df = pd.read_csv("straipsniai.csv")  

  summaries = []
  for text in df["text"]:
      inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
      summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
      summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
      summaries.append(summary)

  df["summary"] = summaries
  df.to_csv("santraukomis.csv", index=False)

Training and evaluation data

The model was fine-tuned on a custom dataset of Lithuanian articles paired with AI-generated summaries. We collected 2,927 Lithuanian articles covering a variety of topics. Each article was paired with two reference summaries created by a GPT-4 variant (GPT-4o-mini), providing high-quality abstractive summaries for training. This yielded approximately 5,854 article-summary pairs in total.

For fine-tuning, the data was split into training and validation sets. 4,683 pairs were used for training and 585 pairs for validation. Each pair consists of a news article as input and one reference summary as the target. (Each article’s two summaries were treated as separate training examples.) The validation set was used to monitor performance (via ROUGE, METEOR, etc.) and select the best model. No human-written summaries were used – the references are entirely AI-generated, so the model learns to mimic the GPT-4o-mini summarization style. The content covers contemporary news; thus, the model has seen a wide range of names, places, and events in the fine-tuning data. It may not generalize to domains not represented in the news data (e.g. poetry).

Training procedure

Cascaded fine-tuning approach: The fine-tuning was done in five consecutive phases, to accommodate limited computational resources and gradually adapt the mBART model to the summarization task:

Phase 1 – Decoder-only fine-tuning: We froze the encoder and trained only the decoder layers on the summarization task. This adapts the model’s generation head to produce fluent summaries in Lithuanian (since the decoder learns to map the internal representation to well-formed output). Training the decoder alone is less memory-intensive and ensures the model starts generating plausible Lithuanian text relevant to the input content.
Phase 2 – Encoder-only fine-tuning: Next, we froze the decoder and trained only the encoder. This helps the model learn better representations of Lithuanian input texts. By updating the encoder, the model improves its understanding of the source article (e.g. which sentences are important), while keeping the already-adapted decoder fixed. This phase focused on encoding the content of news articles effectively in the model’s latent space.
Phase 3 – Full encoder-decoder fine-tuning: We then unfroze the entire model and fine-tuned all weights end-to-end. This combined training further improved performance, allowing encoder and decoder to jointly optimize for the summarization objective. Phase 3 was run for a limited number of epochs to avoid overfitting, given the increased number of trainable parameters. By the end of this phase, the model had substantially learned to produce decent summaries. However, full fine-tuning of a 610M parameter model is VRAM-intensive.
Phase 4 – LoRA fine-tuning with 8-bit quantization: To push performance further without exceeding memory limits, we employed Low-Rank Adaptation (LoRA) for Phase 4. We restored the best Phase 3 model weights, then injected trainable LoRA adapters into the model’s layers (keeping the original weights frozen). We also applied 8-bit quantization (via the bitsandbytes library) to the model weights to drastically reduce memory usage. This allowed us to continue fine-tuning using LoRA’s small additional weight matrices (rank decomposition) with a low memory footprint. In Phase 4, only the LoRA adapter weights were updated (with the base model in 8-bit mode), further adjusting the model to the data. This phase improves the model’s ability to capture finer nuances without modifying the full weight matrix.
Phase 5 – Final LoRA refinement (longer input/output): In the final phase, we fine-tuned again with LoRA (on top of the Phase 4 model) but with extended sequence lengths. We increased the maximum input length to 1024 tokens and output length to 128 tokens (where earlier phases may have used shorter limits). This helps the model handle longer articles and generate longer, more detailed summaries. Phase 5 served as a refinement step with the model at full capacity (encoder and decoder unfrozen via LoRA) and ensured it learns to utilize the full 1024-token context. We fine-tuned for a few epochs until validation metrics stopped improving. The Phase 5 model was selected as the final model for deployment, as it yielded the best performance on the validation set.

Throughout these phases, techniques like learning rate scheduling and gradient accumulation were used to stabilize training. Early stopping on the validation ROUGE was used to prevent overfitting. By the end of Phase 5, the model had seen the data multiple times under different training regimes, each adding to its summarization capability.

Training hyperparameters

Key training hyperparameters and settings across the fine-tuning phases:

Batch size: Effective batch size was 16 sequences. Due to memory limits, we used a micro-batch of 1 per step with gradient accumulation (summing gradients over 16 steps before updating). This yields an equivalent batch of up to 16.

Learning rate: (Phase 1–5). A constant learning rate of 0.0002-0.0005 was used. (We found 0.0002 to work well for these phases; a slightly lower LR was considered when all weights were unfrozen to avoid divergence.)

Optimizer: AdamW optimizer (Adam with weight decay) in 8-bit mode via bitsandbytes. We used the AdamW (bnb 8-bit) optimizer to reduce memory usage of optimizer states. The beta parameters were (0.9, 0.999) and epsilon=1e-8 (0.00000001 (default values))

Low-Rank Adaptation: For Phases 4–5, we used LoRA (via Hugging Face PEFT) with a low rank (r=128) for the adapters. Only the LoRA adapter weights were trainable in those phases, while the base model’s weights remained frozen (but in 8-bit precision)

Quantized training: In Phases 4–5, the model’s base weights were loaded in 8-bit precision (INT8) to save memory, using the bitsandbytes library. This allowed effective fine-tuning of a 610M parameter model on a single GPU with limited VRAM.

Precision: We utilized bfloat16 (bf16) mixed precision for training to further reduce memory and speed up computation. This means model activations and gradients were in bf16 precision (except where maintaining higher precision was necessary for stability).

Epochs: Each phase was trained for a small number of epochs: e.g. Phase 1 ran ~2 epochs, Phase 2 ~2 epochs, Phase 3 ~3 epochs, Phase 4 ~3 epochs, Phase 5 ~3 epochs. We monitored validation loss/ROUGE and saved the best model checkpoint from each phase.

Training results

After Phase 5 fine-tuning, the model achieved its best validation performance. On the validation set (585 article-summary pairs), the model’s summaries were evaluated against the reference summaries using several metrics:

ROUGE-1 (F1): 0.365 – The model recovers about 36.55% of the unigrams from the reference summaries, indicating moderate overlap of key terms.

ROUGE-2 (F1): 16.8 – It recovers about 16.80% of the bigrams, showing modest bi-gram overlap with the reference texts.

ROUGE-L (F1): 27.52 – The model's generated summary covers approximately 27.52% of the longest common subsequences, reflecting moderate sequential overlap.

METEOR: 31.16 – METEOR, accounting for synonymy and recall, is around 31.16%, showing decent semantic alignment with the reference summaries.

BERTScore (F1): 90.27 – The BERTScore (F1) is ~0.9027, indicating a high semantic similarity between the model summary and the reference.

BLEU: 10.96 – The BLEU score, sensitive to exact word order, stands at about 10.96%, demonstrating basic alignment with the wording of reference summaries. (this metric is stricter, since wording variations penalize BLEU).

These evaluation results highlight that the model effectively captures the main semantic content of the articles, with especially high semantic similarity (BERTScore). The moderate scores in ROUGE metrics indicate the summaries often paraphrase or rephrase content rather than directly copying reference wording, characteristic of abstractive summarization. Despite modest BLEU and ROUGE values, the high BERTScore underscores the strength of the model in conveying the core meaning and intent of the original texts.

Overall, the Phase 5 training successfully enhanced the model’s summarization capabilities, allowing effective handling of longer inputs (1024 tokens) and outputs (128 tokens), achieving the optimal balance between fluency, conciseness, and semantic accuracy.

Framework versions

The model was developed and fine-tuned with the following software libraries and versions:

Hugging Face Transformers: 4.51.3

Hugging Face PEFT (Parameter-Efficient Fine-Tuning): 0.15.2

bitsandbytes: 0.41.1 (used for 8-bit optimizers and model quantization)

Datasets: 3.5.0

Evaluate: 0.4.0

PyTorch: 2.6.0+cu124

These correspond to the environments used during training. The code leverages Hugging Face’s Trainer API with PEFT integration for LoRA, and bitsandbytes for 8-bit AdamW. Compatibility with these versions is required to load and use the model as provided. Newer versions should also work, but have not been extensively tested.

GitHub Repository

https://github.com/Asasai001/mBART-fine-tuning

Arnold001
/

mbart-lt-summary-phase5