kshitijthakkar's picture
Update README.md
f3ea6cc verified
---
library_name: transformers
tags:
- qwen3
- qwen3moe
- mixture-of-experts
- llm
- text-generation
- instruction-following
- agentic-ai
- tool-use
- low-resource
- edge-ai
- from-scratch
- causal-lm
license: apache-2.0
datasets:
- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1
---
# ๐Ÿง  LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)
[![Model Size](https://img.shields.io/badge/Parameters-133M-blue)]()
[![Experts](https://img.shields.io/badge/Experts-8-lightgrey)]()
[![Routing](https://img.shields.io/badge/Active_Experts-2-orange)]()
[![Model Size](https://img.shields.io/badge/ActiveParameters-80M-red)]()
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
---
## ๐Ÿ“ Model Card
**LoggenixMoE133M** is a small Mixture-of-Experts (MoE) Causal Language Model trained **from scratch** on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.
- **Architecture**: A lightweight transformer with Mixture-of-Experts routing, **inspired by the innovative architectural design of Qwen3 models.**
- **Parameter Count**: 133M total, with 2 experts active per token (approx. 80M active per step).
- **Experts**: 8 total, gated per token with router logits.
- **Activation Strategy**: Top-2 routing with auxiliary routing loss.
- **Tokenizer Features**: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.
---
## ๐Ÿ“Š Training Details
| Attribute | Value |
|------------------------|------------------------------------------------|
| Total Params | 133M |
| MoE Config | 8 experts, top-2 gating |
| Dataset Type | RCA, code, and logic prompts (15+ task splits) |
| Training Epochs | 5 |
| Effective Tokens Seen | 1.5 Billion |
| Train Loss (final) | 3.263 |
| Val Loss (final) | 3.327 |
| Mean Token Accuracy | ~48% |
| Optimizer | AdamW |
| Scheduler | Linear Warmup + Cosine Decay |
| Precision | FP16 with GradScaler |
| Checkpoint Format | HF-compatible |
| Training Cost | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) |
| Context Length | 1024 |
---
## ๐Ÿงช Intended Use
### โœ… Suitable for:
- Instruction-following tasks
- Root cause analysis (RCA) and structured summarization
- Lightweight code generation (Python)
- Chain-of-thought style reasoning prompts
- **Fine-tuning for specific tasks on edge devices** (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
- **Building specialized AI agents** that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)
### ๐Ÿšซ Not suitable for:
- Long-context tasks (>4K tokens)
- High-stakes factual QA
- Safety-critical decision-making without oversight
---
## ๐Ÿงจ Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned")
messages = [
{
"content": "",
"role": "system"
},
{
"content": "Write a Python function to compute factorial.",
"role": "user"
}
]
# Tokenizer
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
## Alternatively
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=50, # Reduced for testing
do_sample=True,
temperature=0.5,
top_p=0.95,
return_dict_in_generate=True,
use_cache=False # Disable caching to avoid potential issues
)
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(generated_text)
```
---
๐Ÿ”ง Expert Routing
---
This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.
During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.
Note: Routing logits are optionally available in the model outputs via output_router_logits=True.
---
๐Ÿ“ƒ License
---
This model is released under the Apache 2.0 License.
---
๐Ÿ™Œ Acknowledgements
---
Trained using:
---
๐Ÿงจ Hugging Face Transformers
๐Ÿง  Custom training loop with gradient checkpointing
๐Ÿงฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)
๐Ÿ“ฆ Logged and tracked via Weights & Biases
---
### ๐Ÿ—ฃ๏ธ Citation
---
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},
author = {kshitijthakkar},
year = {2025},
url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 },
note = {Trained from scratch on RCA + code + reasoning dataset.}
}
---