Model Card for Model ID

Gemma-2-2B-Aze is a causal language model fine-tuned specifically for Azerbaijani text generation. Built on the robust unsloth/gemma-2-2b base model, this variant has been adapted using parameter-efficient fine-tuning (PEFT) techniques (LoRA) to effectively capture the nuances of the Azerbaijani language.

For fine-tuning, a curated subset of 100K high-quality Azerbaijani text samples was extracted from the expansive LocaDoc/AzTC dataset, which contains 51 million rows. This selective approach ensured that the model learns from diverse and relevant language examples while managing training efficiency.

Model Description

Note: This model does not have conversation power yet.

  • Developed by: Rustam Shiriyev
  • Model type: Causal Transformer-based Language Model (Gemma 2-2B) fine-tuned for Azerbaijani text generation. This model is built on the unsloth/gemma-2-2b base model and uses PEFT (LoRA) techniques for efficient fine-tuning.
  • Language(s) (NLP): Azerbaijani (primary); inherits multilingual capabilities from the base model.
  • License: Google Gemma License (typically non-commercial, research use only – please refer to the original license terms for details).
  • Finetuned from model unsloth/gemma-2-2b

Uses

Autocompletion for Azerbaijani text.

Out-of-Scope Use

  • Instruction-following tasks: This base model is not fine-tuned for chat, Q&A, or structured outputs.
  • Multi-turn dialogues: Lacks conversational templates (e.g., <start_of_turn> used in Gemma-IT).
  • Non-Azerbaijani text: Performance degrades for languages outside its fine-tuning data.

Bias, Risks, and Limitations

Azerbaijani datasets may reflect cultural or societal biases; validate outputs before deployment.

How to Get Started with the Model

from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

login(token="") # Replace with your token

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/gemma-2-2b",
    device_map="auto",
    token="",
)

model = PeftModel.from_pretrained(base_model, "Rustamshery/AzeGemma2-2B")

tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-2-2b")

input_text = "Azərbaycan haqqında nə bilirsən?"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Training Details

Training Data

The model was fine-tuned on a subset of 100,000 Azerbaijani texts from the LocaDoc/AzTC dataset, which contains a total of 51 million entries. This subset was selected to represent a diverse range of topics and writing styles in the Azerbaijani language. Link: https://huggingface.co/datasets/LocalDoc/AzTC

Training Procedure

Preprocessing: Standard text preprocessing steps were applied, including tokenization using the tokenizer from unsloth/gemma-2-2b.

Epochs: 3

Batch size: 8

Learning rate: 3e-5

Optimizer: AdamW

Training Duration: Approximately 6 hours and 23 minutes

Hardware: 2 x NVIDIA Tesla T4 GPUs via Kaggle Notebook

Training regime: fp16 mixed precision

Speeds, Sizes, Times [optional]

Total Training Steps: 9429

Total Training Time: Approximately 6 hours and 23 minutes

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: 7 hours
  • Cloud Provider: Kaggle

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Framework versions

  • PEFT 0.14.0
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rustamshry/AzeGemma2-2B

Base model

unsloth/gemma-2-2b
Adapter
(228)
this model

Dataset used to train Rustamshry/AzeGemma2-2B