Update README.md

f3ea6cc verified 3 months ago

6.17 kB

	---
	library_name: transformers
	tags:
	- qwen3
	- qwen3moe
	- mixture-of-experts
	- llm
	- text-generation
	- instruction-following
	- agentic-ai
	- tool-use
	- low-resource
	- edge-ai
	- from-scratch
	- causal-lm
	license: apache-2.0
	datasets:
	- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1

	---
	# 🧠 LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)

	[![Model Size](https://img.shields.io/badge/Parameters-133M-blue)]()
	[![Experts](https://img.shields.io/badge/Experts-8-lightgrey)]()
	[![Routing](https://img.shields.io/badge/Active_Experts-2-orange)]()
	[![Model Size](https://img.shields.io/badge/ActiveParameters-80M-red)]()
	[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

	---

	## 📝 Model Card

	LoggenixMoE133M is a small Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.

	- Architecture: A lightweight transformer with Mixture-of-Experts routing, inspired by the innovative architectural design of Qwen3 models.
	- Parameter Count: 133M total, with 2 experts active per token (approx. 80M active per step).
	- Experts: 8 total, gated per token with router logits.
	- Activation Strategy: Top-2 routing with auxiliary routing loss.
	- Tokenizer Features: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.

	---

	## 📊 Training Details

	\| Attribute \| Value \|
	\|------------------------\|------------------------------------------------\|
	\| Total Params \| 133M \|
	\| MoE Config \| 8 experts, top-2 gating \|
	\| Dataset Type \| RCA, code, and logic prompts (15+ task splits) \|
	\| Training Epochs \| 5 \|
	\| Effective Tokens Seen \| 1.5 Billion \|
	\| Train Loss (final) \| 3.263 \|
	\| Val Loss (final) \| 3.327 \|
	\| Mean Token Accuracy \| ~48% \|
	\| Optimizer \| AdamW \|
	\| Scheduler \| Linear Warmup + Cosine Decay \|
	\| Precision \| FP16 with GradScaler \|
	\| Checkpoint Format \| HF-compatible \|
	\| Training Cost \| $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) \|
	\| Context Length \| 1024 \|
	---

	## 🧪 Intended Use

	### ✅ Suitable for:
	- Instruction-following tasks
	- Root cause analysis (RCA) and structured summarization
	- Lightweight code generation (Python)
	- Chain-of-thought style reasoning prompts
	- Fine-tuning for specific tasks on edge devices (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
	- Building specialized AI agents that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)

	### 🚫 Not suitable for:
	- Long-context tasks (>4K tokens)
	- High-stakes factual QA
	- Safety-critical decision-making without oversight

	---

	## 🧨 Example Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned")

	messages = [
	{
	"content": "",
	"role": "system"
	},
	{
	"content": "Write a Python function to compute factorial.",
	"role": "user"
	}
	]
	# Tokenizer

	tokenizer.pad_token = tokenizer.eos_token
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
	model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b16-3060-v2-finetuned", device_map="auto")
	memory = model.get_memory_footprint() / 1e6
	print(f"Memory footprint: {memory:,.1f} MB")
	model
	outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
	print(tokenizer.decode(outputs[0]))


	## Alternatively
	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_new_tokens=50, # Reduced for testing
	do_sample=True,
	temperature=0.5,
	top_p=0.95,
	return_dict_in_generate=True,
	use_cache=False # Disable caching to avoid potential issues
	)
	generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
	print(generated_text)
	```
	---
	🔧 Expert Routing
	---

	This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.

	During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.

	Note: Routing logits are optionally available in the model outputs via output_router_logits=True.

	---
	📃 License
	---

	This model is released under the Apache 2.0 License.

	---
	🙌 Acknowledgements
	---
	Trained using:
	---

	🧨 Hugging Face Transformers

	🧠 Custom training loop with gradient checkpointing

	🧮 NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)

	📦 Logged and tracked via Weights & Biases

	---

	### 🗣️ Citation
	---
	@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
	title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},
	author = {kshitijthakkar},
	year = {2025},
	url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 },
	note = {Trained from scratch on RCA + code + reasoning dataset.}
	}
	---