aquif-3-preview
aquif-3 is a lightweight, high-efficiency and ultra-powerful mixture-of-experts model. Using a brand new Mamba-2 hybrid-recurrent architecture, it shows powerful reasoning capabilities and activates only ~1B parameters per forward pass while still delivering competitive results across multiple benchmarks.
Model Overview
- Name:
aquif-3-moe-p1
- Parameters: 6.5 Billion (Mixture-of-Experts)
- Active Parameters: ~1 Billion
- Architecture: Decoder-only transformer (Hybrid-recurrent MoE, 2-of-8 routing)
- Context Window: 128,000 tokens
- Type: General-purpose LLM
- Hosted on: HuggingFace only (llama.cpp does not support this architecture yet)
Key Features
- Sparse, fast & efficient: Uses only ~1B parameters per generation
- Thinking Mode: Activates deeper chain-of-thought reasoning
- 128K context: Supports long conversations, documents, transcripts, and planning
- Runs on local machines: Ideal for edge use, low-resource devices, or offline use
Performance Benchmarks
aquif MoE delivers strong performance despite its minimal activation size:
Benchmark | aquif-3.0-preview-2 (2.5B active) | aquif-3-moe-p1 (1B active) |
---|---|---|
MMLU | 55.9 | 60.4 |
HumanEval | 80.5 | 82.4 |
GSM8K | 72.5 | 70.1 |
Average | 69.6 | 71.0 |
These results reflect internal evaluations on representative test sets. Final scores may vary slightly in public benchmarks.
Thinking Mode
To enhance reasoning, activate "thinking mode" with the following control message before your prompt:
{
"role": "control",
"content": "thinking"
}
or you can use thinking
as True
on your HuggingFace code.
input_ids = tokenizer.apply_chat_template(
conv,
return_tensors="pt",
thinking=True,
return_dict=True,
add_generation_prompt=True
).to(device)
This enables internal self-reflection logic and improves multi-step task accuracy.
Getting Started
To run via Huggingface, you need to install IBM's granitemoe_hybrid_external_cleanup
branch instead of regular HF transformers
, as aquif-3-preview is a finetune of Granite-4.0-Tiny-Base:
git clone https://github.com/Ssukriti/transformers.git
cd transformers
git checkout granitemoe_hybrid_external_cleanup
pip install -e .
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
model_path="aquiffoo/aquif-3-moe-p1"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map=device,
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
model_path
)
conv = [{"role": "user", "content":"Hi!"}]
input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=False, return_dict=True, add_generation_prompt=True).to(device)
set_seed(42)
output = model.generate(
**input_ids,
max_new_tokens=8192,
)
prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)
The future of aquif AI leans towards both dense and Mixture of Experts models, which are smarter and more efficient for inference. We can't wait to see what you are going to create with aquif-3.
- Downloads last month
- 20
Model tree for aquiffoo/aquif-3-moe-p1-6.7b-a1b
Base model
ibm-granite/granite-4.0-tiny-base-preview