Model Card for AzeGemma2-2B

Gemma-2-2B-Aze is a causal language model fine-tuned specifically for Azerbaijani text generation. Built on the robust unsloth/gemma-2-2b base model, this variant has been adapted using parameter-efficient fine-tuning (PEFT) techniques (LoRA) to effectively capture the nuances of the Azerbaijani language.

For fine-tuning, a curated subset of 100K high-quality Azerbaijani text samples was extracted from the expansive LocaDoc/AzTC dataset, which contains 51 million rows. This selective approach ensured that the model learns from diverse and relevant language examples while managing training efficiency.

Model Description

Note: This model does not have conversation power yet.

Developed by: Rustam Shiriyev
Model type: Causal Transformer-based Language Model (Gemma 2-2B) fine-tuned for Azerbaijani text generation. This model is built on the unsloth/gemma-2-2b base model and uses PEFT (LoRA) techniques for efficient fine-tuning.
Language(s) (NLP): Azerbaijani (primary); inherits multilingual capabilities from the base model.
License: Google Gemma License (typically non-commercial, research use only – please refer to the original license terms for details).
Finetuned from model unsloth/gemma-2-2b

Uses

Autocompletion for Azerbaijani text.

Out-of-Scope Use

Instruction-following tasks: This base model is not fine-tuned for chat, Q&A, or structured outputs.
Multi-turn dialogues: Lacks conversational templates (e.g., <start_of_turn> used in Gemma-IT).
Non-Azerbaijani text: Performance degrades for languages outside its fine-tuning data.

Bias, Risks, and Limitations

Azerbaijani datasets may reflect cultural or societal biases; validate outputs before deployment.

How to Get Started with the Model

from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

login(token="") # Replace with your token

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/gemma-2-2b",
    device_map="auto",
    token="",
)

model = PeftModel.from_pretrained(base_model, "Rustamshry/AzeGemma2-2B")

tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-2-2b")

input_text = "Azərbaycan haqqında nə bilirsən?"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Training Details

Training Data

The model was fine-tuned on a subset of 100,000 Azerbaijani texts from the LocaDoc/AzTC dataset, which contains a total of 51 million entries. This subset was selected to represent a diverse range of topics and writing styles in the Azerbaijani language. Link: https://huggingface.co/datasets/LocalDoc/AzTC

Training Procedure

Preprocessing: Standard text preprocessing steps were applied, including tokenization using the tokenizer from unsloth/gemma-2-2b.

Epochs: 3

Batch size: 8

Learning rate: 3e-5

Optimizer: AdamW

Training Duration: Approximately 6 hours and 23 minutes

Hardware: 2 x NVIDIA Tesla T4 GPUs via Kaggle Notebook

Training regime: fp16 mixed precision

Speeds, Sizes, Times [optional]

Total Training Steps: 9429

Total Training Time: Approximately 6 hours and 23 minutes

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: 7 hours
Cloud Provider: Kaggle

Model Card Authors [optional]

[More Information Needed]

Model Card Contact