🧠 Model Card: Daemontatox/Droidz

Daemontatox/Droidz is a highly-optimized, compact language model built on top of unsloth/qwen3-1.7b, engineered for fast, intelligent inference on consumer-grade devices. It's part of an ongoing research effort to close the performance gap between small and large language models using architectural efficiency, reflective reasoning techniques, and lightweight distributed training.

🧬 Objective

The goal of Droidz is to:

Achieve close-to-7B model quality with <2B parameter models.
Support edge deployment: mobile, CPU, small GPU.
Provide accurate, fast, reflective generation in constrained environments.
Enable scalable fine-tuning through efficient, distributed training pipelines.

🛠️ Model Overview

Field	Detail
Base model	`unsloth/qwen3-1.7b`
Architecture	Transformer, Qwen3-architecture (2.7x faster rope)
Finetuned on	Proprietary curated instruction + reasoning dataset
Training Method	Distributed LoRA + Flash-Attn2 + PEFT + DDP
Model Size	~1.7B params
Precision	bfloat16 (training), supports int4/int8 (inference)
Language	English only (monolingual)
License	Apache-2.0
Intended Use	Conversational AI, edge agents, assistants, embedded systems

🏗️ Training Details

Training Infrastructure

Frameworks: transformers, unsloth, accelerate, PEFT
Backends: Fully-distributed with DeepSpeed Zero 2, DDP, fsdp, and Flash Attention v2
Devices: A100 (80GB), RTX 3090 clusters, TPU v5e (mixed)
Optimizer: AdamW + Cosine LR schedule + Warmup steps
Batching: Dynamic packing enabled, up to 2048 context tokens
Checkpointing: Async gradient checkpointing for memory efficiency
Duration: ~1.2M steps across multiple domains

Finetuning Methodology

Reflection prompting: Models are trained to self-verify and revise outputs.
Instruction tuning: Curated prompt-response pairs across diverse reasoning domains.
Multi-domain generalization: Code, logic puzzles, philosophy, and conversational tasks.
Optimization: Gradient accumulation + progressive layer freezing.

🔮 Example Use Cases

Conversational AI for mobile and web apps
Offline reasoning agents (Raspberry Pi, Jetson Nano, etc.)
Embedded chatbots with local-only privacy
Edge-side logic assistants for industry-specific workflows
Autonomous tools for summarization, code suggestion, self-verification

⚡ Inference Code

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = "Daemontatox/Droidz"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # or {"": "cuda:0"} for manual
    torch_dtype="auto"  # uses bf16/fp16 if available
)

streamer = TextStreamer(tokenizer)

prompt = "Explain the concept of reinforcement learning simply."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)

🧪 Performance Benchmarks

Hardware	Mode	Throughput	VRAM / RAM	Notes
RTX 3060 12GB (FP16)	Transformers	~37 tokens/s	~5.1 GB	Good for batch inference
MacBook M2 (Metal backend)	Transformers	~23 tokens/s	~3.6 GB	Works well on 8-core M2
Intel i7-12700H (CPU-only)	GGUF (Q4)	~8 tokens/s	~4.1 GB	Llama.cpp via `llm` or Koboldcpp
Jetson Orin Nano (8GB)	INT4 GGUF	~6 tokens/s	~3.2 GB	Embedded/IoT ready

🧠 Prompt Samples

❓ Prompt: "What is backpropagation in neural networks?"

Backpropagation is a training algorithm that adjusts a neural network’s weights by computing gradients of error from output to input layers using the chain rule. It’s the core of how neural networks learn.

🔧 Prompt: "Fix the bug: `print('Score:' + 100)"

You’re trying to concatenate a string with an integer. Use: print('Score:' + str(100))

🔍 Prompt: "Summarize the Stoic concept of control."

Stoics believe in focusing only on what you can control—your actions and thoughts—while accepting what you cannot control with calm detachment.

🔐 Quantization Support (Deployment-Ready)

Format	Status	Tool	Notes
GGUF	✅ Stable	llama.cpp	Works on CPUs, Android, Web
GPTQ	✅ Stable	AutoGPTQ	For fast GPU inference
AWQ	✅ Tested	AutoAWQ	4-bit low-latency inference
FP16	✅ Native	Transformers	RTX/Apple Metal ready
bfloat16	✅	Transformers	For A100/TPU-friendly runs

🧱 Architecture Enhancements

FlashAttention2: Fused softmax and dropout for 2–3x attention speed boost.
Unslo†h Patch: Accelerated training/inference kernel replacements
Rope Scaling: Extended context window support for long-input reasoning
Rotary Embedding Interpolation: Improves generalization beyond pretraining length
LayerDrop + Activation Checkpointing: For ultra-efficient memory training

✅ Intended Use

Use Case	Suitable
Local chatbots / assistants	✅
Developer coding copilots	✅
Offline reasoning agents	✅
Educational agents	✅
Legal / financial advisors	❌
Medical diagnosis	❌

Model is not suitable for domains where accuracy or factual correctness is critical without verification.

🚫 Known Limitations

Context length currently capped at 2048 (can be increased via RoPE interpolation).
Struggles with long-form generation (>1024 tokens).
Not multilingual (yet).
Sensitive to prompt phrasing without CoT or self-correction format.

📍 Roadmap

Expand to multilingual support via cross-lingual bootstrapping.
Integrate Mamba-style recurrence for long-context inference.
Release optimized GGUF + quantized weights for browser/Android.
Explore retrieval-augmented reflection (RAR) capabilities.

👨‍💻 Author

Name: Daemontatox
Affiliation: Independent Researcher
Contact: HuggingFace Profile
Focus: LLM compression, theory of mind, agent intelligence on the edge

📖 Citation

@misc{daemontatox2025droidz,
  title={Droidz: A Fast, Reflective Small Language Model for Reasoning on Edge Devices},
  author={Daemontatox},
  year={2025},
  howpublished={\url{https://huggingface.co/Daemontatox/Droidz}},
  note={Ongoing Research}
}

Daemontatox
/

Droidz