🎓 CengBot - Çukurova University Computer Engineering Assistant

Author: naholav
Date: August 2025

A bilingual (Turkish/English) AI assistant specifically trained for Çukurova University Computer Engineering Department students using LLaMA 3.2 3B with LoRA fine-tuning.

📋 Model Details

Developed by: naholav
Model type: Causal Language Model (LLaMA 3.2 3B + LoRA)
Language(s): Turkish, English
License: Apache-2.0
Finetuned from: meta-llama/Llama-3.2-3B
Library: PEFT 0.15.2
Dataset: Naholav/cukurova_university_chatbot

🔗 Project Links

🤖 Advanced Inference Script

We provide a carefully optimized standalone inference script (cengbot_inference.py) that incorporates extensive prompt engineering and parameter tuning developed through iterative testing.

🔧 Script Features

Dual Prompt Engineering:

  • Turkish Prompt: Specifically crafted system prompt with strict response guidelines (max 4 sentences)
  • English Prompt: Parallel optimized prompt maintaining consistency across languages
  • Language Detection: Multi-layer detection using Turkish characters, common words, and langdetect library fallback

Optimized Generation Parameters:

# Carefully tuned through extensive testing
temperature = 0.7        # Balanced creativity vs consistency
max_new_tokens = 200     # Sufficient for detailed answers
repetition_penalty = 1.1 # Prevents repetitive responses
top_p = 0.95            # Nucleus sampling for quality
top_k = 50              # Top-k filtering for coherence

Smart Response Control:

  • Greeting Detection: Automatically reduces token usage for simple greetings (30 tokens vs 100)
  • Response Limiting: Enforces maximum 4 sentences with intelligent sentence splitting
  • Question Limiting: Restricts model to ask maximum 1 question per response
  • Language-Specific Processing: Different response extraction logic for Turkish vs English

Prompt Engineering Process: The system prompts were developed through extensive trial-and-error testing to:

  • Maintain university-specific context
  • Ensure concise but helpful responses
  • Balance friendliness with professionalism
  • Optimize for educational support scenarios
  • Handle both casual questions and technical inquiries

System Prompt Examples:

Turkish System Prompt (Optimized):

Sen Çukurova Üniversitesi Bilgisayar Mühendisliği bölümünün deneyimli dijital asistanısın. 
Öğrencilere samimi, yardımsever ve doğru bilgiler vererek destek oluyorsun.

Önemli kurallar:
- Her soruya MAKSİMUM 4 CÜMLE ile yanıt ver
- Cevapların çok kısa, net ve anlaşılır olmalı
- Gereksiz açıklamalardan ve tekrarlardan kesinlikle kaçın
- Selamlama mesajlarına tek cümlelik karşılık ver
- Sadece sorulan soruya odaklan, ekstra bilgi verme
- Cevaplarında MAKSİMUM 1 adet soru sorabilirsin

English System Prompt (Parallel):

You are an experienced digital assistant for Çukurova University Computer Engineering Department. 
You help students by providing friendly, helpful and accurate information.

Important rules:
- Answer each question with MAXIMUM 4 SENTENCES
- Keep your answers very short, clear and understandable
- Absolutely avoid unnecessary explanations and repetitions
- Reply to greeting messages with a single sentence
- Focus only on the asked question, don't provide extra information
- You can ask MAXIMUM 1 question in your answers

🚀 How to Get Started

Direct Use:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Naholav/cengbot-lora-tr-en-cukurova")

# Generate response
inputs = tokenizer("Merhaba, bilgisayar mühendisliği hakkında bilgi verebilir misin?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Simple Setup:

  1. Install dependencies:
pip install torch transformers peft langdetect
  1. Download and run:
wget https://raw.githubusercontent.com/naholav/cengbot/main/cengbot_inference.py
python cengbot_inference.py

That's it! The model will be downloaded automatically from HuggingFace.

🎯 Features

  • Dual Language Support: Automatically detects Turkish or English input using langdetect
  • Dual Prompt System: Separate optimized prompts for Turkish and English responses
  • Dual Language Validation: Separate TR/EN validation during training (TR Loss: 0.6750, EN Loss: 0.7387)
  • University-Specific: Trained on Çukurova University Computer Engineering data
  • Concise Responses: Maximum 4 sentences per response with smart length control
  • Smart Greeting Detection: Optimized token usage for greetings vs detailed questions
  • Language Detection: Advanced Turkish character and word detection fallbacks
  • Easy Setup: Just copy and run the inference script

📋 Model Details

Architecture:

  • Base Model: meta-llama/Llama-3.2-3B (3.2B parameters, 28 layers, 3072 hidden size)
  • Model Type: LLaMA (torch.bfloat16)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • LoRA Config: r=16, alpha=32, dropout=0.1
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Specifications:

  • Languages: Turkish, English (perfectly balanced bilingual training)
  • Domain: Computer Engineering Education (Çukurova University specific)
  • Response Style: Concise, helpful, friendly (maximum 4 sentences)
  • Prompt Engineering: Dual system prompts with language-specific optimization
  • Language Detection: Advanced Turkish character + word detection with langdetect fallback
  • Generation Parameters: temp=0.7, top_p=0.95, top_k=50, rep_penalty=1.1

Hardware & Performance:

  • Training Hardware: RTX 5090 32GB (90% utilization), 128GB RAM, AMD Ryzen 9 9950X
  • Best Validation: 0.6920 overall, 0.6636 TR, 0.7268 EN
  • Inference Speed: Optimized for real-time chat with greeting detection

💻 System Requirements

For Inference:

  • GPU: CUDA-compatible (8GB+ VRAM recommended)
  • RAM: 8GB minimum, 16GB recommended
  • Python: 3.8+
  • Internet: For initial model download

📖 Usage Examples

Turkish:

👤 You: Merhaba, bilgisayar mühendisliği hakkında bilgi verebilir misin?
🤖 CengBot: Merhaba! Bilgisayar mühendisliği, yazılım geliştirme, sistem tasarımı ve teknoloji çözümleri üzerine odaklanan bir disiplindir. Çukurova Üniversitesi'nde güçlü bir müfredatımız var. Hangi konuda detay istiyorsun?

English:

👤 You: What programming languages should I learn first?
🤖 CengBot: I recommend starting with Python for its simplicity and versatility, then moving to Java or C++ for fundamental concepts. These languages provide a solid foundation for computer engineering. Which area interests you most - web development, systems programming, or data science?

🔧 Training Performance

Dataset Information:

  • Total Examples: 22,524 (11,187 Turkish, 11,337 English)
  • Training Set: 20,948 examples
  • Validation Sets: 783 Turkish, 793 English (7% split each)

Training Progress:

  • Total Steps: 1170/1312 (Early stopping activated at step 1170)
  • Best Eval Loss: 0.6920 (Step 975) → Model saved from best checkpoint
  • Final Train Loss: 0.5104 (continuous improvement)
  • Early Stopping: Patience=3, triggered due to eval loss increase
  • Training Time: ~3 hours on RTX 5090

Best Model Performance (Step 975):

  • Overall Eval Loss: 0.6920
  • Turkish Validation Loss: 0.6636
  • English Validation Loss: 0.7268
  • Average Loss: 0.6952

Training Features:

  • Dual Language Evaluation: Separate TR/EN validation tracking every step
  • Early Stopping: Patience=3 prevents overfitting, saves best model
  • Advanced Optimization: Mixed precision (BF16), gradient checkpointing
  • LoRA Configuration: r=16, alpha=32, dropout=0.1, targets all linear layers
  • Hardware Optimization: RTX 5090 32GB (90% VRAM usage), 128GB RAM, 16-core CPU
  • Memory Management: Pre-cached datasets in RAM, persistent workers

Prompt Engineering & Data Processing:

  • Dual Template System: Separate training templates for Turkish and English
  • Response Masking: Only LoRA adapter learns from assistant responses, not user questions
  • Template Consistency: Identical structure between training and inference prompts
  • Balanced Dataset: Perfectly balanced Turkish (11,187) and English (11,337) examples
  • Response Quality Control: Training data filtered for concise, educational responses

⚠️ Limitations and Bias

  • Domain Specific: Optimized for computer engineering education
  • Response Length: Limited to maximum 4 sentences for conciseness
  • University Context: Trained specifically on Çukurova University data
  • Language Balance: Balanced TR/EN training but may show slight performance differences

🔧 Technical Specifications

Hardware: RTX 5090 32GB, 128GB RAM, AMD Ryzen 9 9950X
Training Time: ~3 hours
Framework: PEFT 0.15.2, Transformers, PyTorch
Optimization: Mixed precision (BF16), gradient checkpointing

📈 Environmental Impact

Hardware Type: NVIDIA RTX 5090
Hours Used: ~3 hours
Compute Region: Local training setup
Carbon Emissions: Minimal due to efficient LoRA training

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

📞 Contact

📚 Citation

@misc{cengbot2025,
  title={CengBot: A Bilingual AI Assistant for Computer Engineering Education},
  author={naholav},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Naholav/cengbot-lora-tr-en-cukurova}
}

Happy learning! 🎓

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Naholav/cengbot-lora-tr-en-cukurova

Adapter
(177)
this model

Dataset used to train Naholav/cengbot-lora-tr-en-cukurova