|
--- |
|
library_name: transformers |
|
tags: |
|
- qwen3 |
|
- qwen3moe |
|
- mixture-of-experts |
|
- llm |
|
- text-generation |
|
- instruction-following |
|
- agentic-ai |
|
- tool-use |
|
- low-resource |
|
- edge-ai |
|
- from-scratch |
|
- causal-lm |
|
license: apache-2.0 |
|
datasets: |
|
- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1 |
|
|
|
--- |
|
# ๐ง LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A) |
|
|
|
[]() |
|
[]() |
|
[]() |
|
[]() |
|
[](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
--- |
|
|
|
## ๐ Model Card |
|
|
|
**LoggenixMoE133M** is a small Mixture-of-Experts (MoE) Causal Language Model trained **from scratch** on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks. |
|
|
|
- **Architecture**: A lightweight transformer with Mixture-of-Experts routing, **inspired by the innovative architectural design of Qwen3 models.** |
|
- **Parameter Count**: 133M total, with 2 experts active per token (approx. 80M active per step). |
|
- **Experts**: 8 total, gated per token with router logits. |
|
- **Activation Strategy**: Top-2 routing with auxiliary routing loss. |
|
- **Tokenizer Features**: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents. |
|
|
|
--- |
|
|
|
## ๐ Training Details |
|
|
|
| Attribute | Value | |
|
|------------------------|------------------------------------------------| |
|
| Total Params | 133M | |
|
| MoE Config | 8 experts, top-2 gating | |
|
| Dataset Type | RCA, code, and logic prompts (15+ task splits) | |
|
| Training Epochs | 5 | |
|
| Effective Tokens Seen | 1.5 Billion | |
|
| Train Loss (final) | 3.263 | |
|
| Val Loss (final) | 3.327 | |
|
| Mean Token Accuracy | ~48% | |
|
| Optimizer | AdamW | |
|
| Scheduler | Linear Warmup + Cosine Decay | |
|
| Precision | FP16 with GradScaler | |
|
| Checkpoint Format | HF-compatible | |
|
| Training Cost | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) | |
|
| Context Length | 1024 | |
|
--- |
|
|
|
## ๐งช Intended Use |
|
|
|
### โ
Suitable for: |
|
- Instruction-following tasks |
|
- Root cause analysis (RCA) and structured summarization |
|
- Lightweight code generation (Python) |
|
- Chain-of-thought style reasoning prompts |
|
- **Fine-tuning for specific tasks on edge devices** (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection) |
|
- **Building specialized AI agents** that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents) |
|
|
|
### ๐ซ Not suitable for: |
|
- Long-context tasks (>4K tokens) |
|
- High-stakes factual QA |
|
- Safety-critical decision-making without oversight |
|
|
|
--- |
|
|
|
## ๐งจ Example Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned") |
|
|
|
messages = [ |
|
{ |
|
"content": "", |
|
"role": "system" |
|
}, |
|
{ |
|
"content": "Write a Python function to compute factorial.", |
|
"role": "user" |
|
} |
|
] |
|
# Tokenizer |
|
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") |
|
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned", device_map="auto") |
|
memory = model.get_memory_footprint() / 1e6 |
|
print(f"Memory footprint: {memory:,.1f} MB") |
|
model |
|
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512) |
|
print(tokenizer.decode(outputs[0])) |
|
|
|
|
|
## Alternatively |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
inputs, |
|
max_new_tokens=50, # Reduced for testing |
|
do_sample=True, |
|
temperature=0.5, |
|
top_p=0.95, |
|
return_dict_in_generate=True, |
|
use_cache=False # Disable caching to avoid potential issues |
|
) |
|
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
--- |
|
๐ง Expert Routing |
|
--- |
|
|
|
This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits. |
|
|
|
During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability. |
|
|
|
Note: Routing logits are optionally available in the model outputs via output_router_logits=True. |
|
|
|
--- |
|
๐ License |
|
--- |
|
|
|
This model is released under the Apache 2.0 License. |
|
|
|
--- |
|
๐ Acknowledgements |
|
--- |
|
Trained using: |
|
--- |
|
|
|
๐งจ Hugging Face Transformers |
|
|
|
๐ง Custom training loop with gradient checkpointing |
|
|
|
๐งฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB) |
|
|
|
๐ฆ Logged and tracked via Weights & Biases |
|
|
|
--- |
|
|
|
### ๐ฃ๏ธ Citation |
|
--- |
|
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060, |
|
title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model}, |
|
author = {kshitijthakkar}, |
|
year = {2025}, |
|
url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 }, |
|
note = {Trained from scratch on RCA + code + reasoning dataset.} |
|
} |
|
--- |