Model Description

This model is a Continued Pre-Training adaptation of Mistral-7B v0.3, extended to the Malagasy language.

Since the original Mistral-7B does not support Malagasy, this model demonstrates how continued pretraining can extend large language models to low-resource languages.

The resulting model improves fluency and coherence in Malagasy and provides a strong foundation for downstream Malagasy NLP tasks.


Intended Uses & Limitations

Use cases:

  • Generating text in Malagasy
  • Research on low-resource language adaptation
  • Data augmentation for Malagasy NLP tasks

Limitations:

  • Not instruction-tuned: responses may not always follow task instructions.
  • May hallucinate or generate factually inaccurate information.

Training Details

  • Base Model: Mistral-7B v0.3
  • Method: Continued Pretraining with LoRA adapters
  • Hardware: 1 × Tesla T4 (14.7 GB VRAM)
  • Number of Epochs: 1
  • Trainable parameters: ~604M (7.7% of 7.85B total)
  • Aproximative Training Time: ~44 hours

Inference Example Usage

code:

# Import required libraries for model loading and text generation
  from unsloth import FastLanguageModel
  from transformers import TextStreamer
  import torch

  # Load the pretrained Malagasy LoRA model and tokenizer
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name="Lo-Renz-O/Mistral-7B-CPT-Malagasy-v0.1-LoRA",
      max_seq_length=1024,
      dtype=None,
      load_in_4bit=True,
  )

  # Enable optimized inference
  FastLanguageModel.for_inference(model)

  # Define the prompt template for text generation
  prompt = """Lahatsoratra
  ### Lohateny: {}

  ### Lahatsoratra:
  {}"""

  # Tokenize the prompt and move tensors to GPU
  inputs = tokenizer(
      [prompt.format("Madagasikara", "")],
      return_tensors="pt",
  ).to("cuda")

  # Initialize a streamer to display generated tokens in real-time
  text_streamer = TextStreamer(tokenizer, skip_special_tokens=True)

  # Generate text using the model with specific generation parameters
  outputs = model.generate(
      **inputs,
      max_new_tokens=512,
      temperature=0.7,
      top_p=0.95,
      repetition_penalty=1.0,
      do_sample=True,
      streamer=text_streamer,
  )

output:

  Lahatsoratra
  ### Lohateny: Madagasikara
  ### Lahatsoratra: I Madagasikara na Repoblikan' i Madagasikara dia firenena any amin' ny faritra atsimon' i Afrika,
  voafaritr' i Maorisy ao avaratra-andrefana. Izy no lemaka fahaefatra indrindra eto an-tany (1 244 350 km²). Anisan’ ny
  nosy lehibe indrindra eran-tany izy sady malaza amin’ny fisian’ny biby sy zavamaniry mampiavaka azy manokana ary manambatran’ny
  ala trôpikaly. Firenen’ ny mponina maromaro isaky ny velaran-taniny ity firenena ity. Mizara roa lehibe ny vahoakan’ i Madagasikara
  ka ny iray Malagasy avokoa (Malaio-Pôlineziana), fa ny faharoa Banto avy any amin’ ny morontsiraka atsinanan' i Afrika.

This mistral model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lo-Renz-O/Mistral-7B-CPT-Malagasy-v0.1-LoRA

Finetuned
(585)
this model