๐ง Model Card: Daemontatox/Droidz
Daemontatox/Droidz is a highly-optimized, compact language model built on top of unsloth/qwen3-1.7b
, engineered for fast, intelligent inference on consumer-grade devices. It's part of an ongoing research effort to close the performance gap between small and large language models using architectural efficiency, reflective reasoning techniques, and lightweight distributed training.
๐งฌ Objective
The goal of Droidz is to:
- Achieve close-to-7B model quality with <2B parameter models.
- Support edge deployment: mobile, CPU, small GPU.
- Provide accurate, fast, reflective generation in constrained environments.
- Enable scalable fine-tuning through efficient, distributed training pipelines.
๐ ๏ธ Model Overview
Field | Detail |
---|---|
Base model | unsloth/qwen3-1.7b |
Architecture | Transformer, Qwen3-architecture (2.7x faster rope) |
Finetuned on | Proprietary curated instruction + reasoning dataset |
Training Method | Distributed LoRA + Flash-Attn2 + PEFT + DDP |
Model Size | ~1.7B params |
Precision | bfloat16 (training), supports int4/int8 (inference) |
Language | English only (monolingual) |
License | Apache-2.0 |
Intended Use | Conversational AI, edge agents, assistants, embedded systems |
๐๏ธ Training Details
Training Infrastructure
- Frameworks:
transformers
,unsloth
,accelerate
,PEFT
- Backends: Fully-distributed with
DeepSpeed Zero 2
,DDP
,fsdp
, andFlash Attention v2
- Devices: A100 (80GB), RTX 3090 clusters, TPU v5e (mixed)
- Optimizer: AdamW + Cosine LR schedule + Warmup steps
- Batching: Dynamic packing enabled, up to 2048 context tokens
- Checkpointing: Async gradient checkpointing for memory efficiency
- Duration: ~1.2M steps across multiple domains
Finetuning Methodology
- Reflection prompting: Models are trained to self-verify and revise outputs.
- Instruction tuning: Curated prompt-response pairs across diverse reasoning domains.
- Multi-domain generalization: Code, logic puzzles, philosophy, and conversational tasks.
- Optimization: Gradient accumulation + progressive layer freezing.
๐ฎ Example Use Cases
- Conversational AI for mobile and web apps
- Offline reasoning agents (Raspberry Pi, Jetson Nano, etc.)
- Embedded chatbots with local-only privacy
- Edge-side logic assistants for industry-specific workflows
- Autonomous tools for summarization, code suggestion, self-verification
โก Inference Code
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_id = "Daemontatox/Droidz"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # or {"": "cuda:0"} for manual
torch_dtype="auto" # uses bf16/fp16 if available
)
streamer = TextStreamer(tokenizer)
prompt = "Explain the concept of reinforcement learning simply."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)
๐งช Performance Benchmarks
Hardware | Mode | Throughput | VRAM / RAM | Notes |
---|---|---|---|---|
RTX 3060 12GB (FP16) | Transformers | ~37 tokens/s | ~5.1 GB | Good for batch inference |
MacBook M2 (Metal backend) | Transformers | ~23 tokens/s | ~3.6 GB | Works well on 8-core M2 |
Intel i7-12700H (CPU-only) | GGUF (Q4) | ~8 tokens/s | ~4.1 GB | Llama.cpp via llm or Koboldcpp |
Jetson Orin Nano (8GB) | INT4 GGUF | ~6 tokens/s | ~3.2 GB | Embedded/IoT ready |
๐ง Prompt Samples
โ Prompt: "What is backpropagation in neural networks?"
Backpropagation is a training algorithm that adjusts a neural networkโs weights by computing gradients of error from output to input layers using the chain rule. Itโs the core of how neural networks learn.
๐ง Prompt: "Fix the bug: `print('Score:' + 100)"
Youโre trying to concatenate a string with an integer. Use:
print('Score:' + str(100))
๐ Prompt: "Summarize the Stoic concept of control."
Stoics believe in focusing only on what you can controlโyour actions and thoughtsโwhile accepting what you cannot control with calm detachment.
๐ Quantization Support (Deployment-Ready)
Format | Status | Tool | Notes |
---|---|---|---|
GGUF | โ Stable | llama.cpp | Works on CPUs, Android, Web |
GPTQ | โ Stable | AutoGPTQ | For fast GPU inference |
AWQ | โ Tested | AutoAWQ | 4-bit low-latency inference |
FP16 | โ Native | Transformers | RTX/Apple Metal ready |
bfloat16 | โ | Transformers | For A100/TPU-friendly runs |
๐งฑ Architecture Enhancements
- FlashAttention2: Fused softmax and dropout for 2โ3x attention speed boost.
- Unsloโ h Patch: Accelerated training/inference kernel replacements
- Rope Scaling: Extended context window support for long-input reasoning
- Rotary Embedding Interpolation: Improves generalization beyond pretraining length
- LayerDrop + Activation Checkpointing: For ultra-efficient memory training
โ Intended Use
Use Case | Suitable |
---|---|
Local chatbots / assistants | โ |
Developer coding copilots | โ |
Offline reasoning agents | โ |
Educational agents | โ |
Legal / financial advisors | โ |
Medical diagnosis | โ |
Model is not suitable for domains where accuracy or factual correctness is critical without verification.
๐ซ Known Limitations
- Context length currently capped at 2048 (can be increased via RoPE interpolation).
- Struggles with long-form generation (>1024 tokens).
- Not multilingual (yet).
- Sensitive to prompt phrasing without CoT or self-correction format.
๐ Roadmap
- Expand to multilingual support via cross-lingual bootstrapping.
- Integrate Mamba-style recurrence for long-context inference.
- Release optimized GGUF + quantized weights for browser/Android.
- Explore retrieval-augmented reflection (RAR) capabilities.
๐จโ๐ป Author
- Name: Daemontatox
- Affiliation: Independent Researcher
- Contact: HuggingFace Profile
- Focus: LLM compression, theory of mind, agent intelligence on the edge
๐ Citation
@misc{daemontatox2025droidz,
title={Droidz: A Fast, Reflective Small Language Model for Reasoning on Edge Devices},
author={Daemontatox},
year={2025},
howpublished={\url{https://huggingface.co/Daemontatox/Droidz}},
note={Ongoing Research}
}
- Downloads last month
- 22