Model Card for Model ID
Gemma-2-2B-Aze is a causal language model fine-tuned specifically for Azerbaijani text generation. Built on the robust unsloth/gemma-2-2b base model, this variant has been adapted using parameter-efficient fine-tuning (PEFT) techniques (LoRA) to effectively capture the nuances of the Azerbaijani language.
For fine-tuning, a curated subset of 100K high-quality Azerbaijani text samples was extracted from the expansive LocaDoc/AzTC dataset, which contains 51 million rows. This selective approach ensured that the model learns from diverse and relevant language examples while managing training efficiency.
Model Description
Note: This model does not have conversation power yet.
- Developed by: Rustam Shiriyev
- Model type: Causal Transformer-based Language Model (Gemma 2-2B) fine-tuned for Azerbaijani text generation. This model is built on the unsloth/gemma-2-2b base model and uses PEFT (LoRA) techniques for efficient fine-tuning.
- Language(s) (NLP): Azerbaijani (primary); inherits multilingual capabilities from the base model.
- License: Google Gemma License (typically non-commercial, research use only – please refer to the original license terms for details).
- Finetuned from model unsloth/gemma-2-2b
Uses
Autocompletion for Azerbaijani text.
Out-of-Scope Use
- Instruction-following tasks: This base model is not fine-tuned for chat, Q&A, or structured outputs.
- Multi-turn dialogues: Lacks conversational templates (e.g.,
<start_of_turn>
used in Gemma-IT). - Non-Azerbaijani text: Performance degrades for languages outside its fine-tuning data.
Bias, Risks, and Limitations
Azerbaijani datasets may reflect cultural or societal biases; validate outputs before deployment.
How to Get Started with the Model
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
login(token="") # Replace with your token
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/gemma-2-2b",
device_map="auto",
token="",
)
model = PeftModel.from_pretrained(base_model, "Rustamshery/AzeGemma2-2B")
tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-2-2b")
input_text = "Azərbaycan haqqında nə bilirsən?"
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Training Details
Training Data
The model was fine-tuned on a subset of 100,000 Azerbaijani texts from the LocaDoc/AzTC dataset, which contains a total of 51 million entries. This subset was selected to represent a diverse range of topics and writing styles in the Azerbaijani language. Link: https://huggingface.co/datasets/LocalDoc/AzTC
Training Procedure
Preprocessing: Standard text preprocessing steps were applied, including tokenization using the tokenizer from unsloth/gemma-2-2b.
Epochs: 3
Batch size: 8
Learning rate: 3e-5
Optimizer: AdamW
Training Duration: Approximately 6 hours and 23 minutes
Hardware: 2 x NVIDIA Tesla T4 GPUs via Kaggle Notebook
Training regime: fp16 mixed precision
Speeds, Sizes, Times [optional]
Total Training Steps: 9429
Total Training Time: Approximately 6 hours and 23 minutes
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: 7 hours
- Cloud Provider: Kaggle
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
Framework versions
- PEFT 0.14.0
- Downloads last month
- 2
Model tree for Rustamshry/AzeGemma2-2B
Base model
unsloth/gemma-2-2b