aquif-3-preview

aquif-3 is a lightweight, high-efficiency and ultra-powerful mixture-of-experts model. Using a brand new Mamba-2 hybrid-recurrent architecture, it shows powerful reasoning capabilities and activates only ~1B parameters per forward pass while still delivering competitive results across multiple benchmarks.

Model Overview

  • Name: aquif-3-moe-p1
  • Parameters: 6.5 Billion (Mixture-of-Experts)
  • Active Parameters: ~1 Billion
  • Architecture: Decoder-only transformer (Hybrid-recurrent MoE, 2-of-8 routing)
  • Context Window: 128,000 tokens
  • Type: General-purpose LLM
  • Hosted on: HuggingFace only (llama.cpp does not support this architecture yet)

Key Features

  • Sparse, fast & efficient: Uses only ~1B parameters per generation
  • Thinking Mode: Activates deeper chain-of-thought reasoning
  • 128K context: Supports long conversations, documents, transcripts, and planning
  • Runs on local machines: Ideal for edge use, low-resource devices, or offline use

Performance Benchmarks

aquif MoE delivers strong performance despite its minimal activation size:

Benchmark aquif-3.0-preview-2 (2.5B active) aquif-3-moe-p1 (1B active)
MMLU 55.9 60.4
HumanEval 80.5 82.4
GSM8K 72.5 70.1
Average 69.6 71.0

These results reflect internal evaluations on representative test sets. Final scores may vary slightly in public benchmarks.

Thinking Mode

To enhance reasoning, activate "thinking mode" with the following control message before your prompt:

{
  "role": "control",
  "content": "thinking"
}

or you can use thinking as True on your HuggingFace code.

input_ids = tokenizer.apply_chat_template(
    conv,
    return_tensors="pt",
    thinking=True,
    return_dict=True,
    add_generation_prompt=True
).to(device)

This enables internal self-reflection logic and improves multi-step task accuracy.

Getting Started

To run via Huggingface, you need to install IBM's granitemoe_hybrid_external_cleanup branch instead of regular HF transformers, as aquif-3-preview is a finetune of Granite-4.0-Tiny-Base:

git clone https://github.com/Ssukriti/transformers.git
cd transformers
git checkout granitemoe_hybrid_external_cleanup
pip install -e .
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

model_path="aquiffoo/aquif-3-moe-p1"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=torch.bfloat16,
    )
tokenizer = AutoTokenizer.from_pretrained(
        model_path
)

conv = [{"role": "user", "content":"Hi!"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=False, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42)
output = model.generate(
    **input_ids,
    max_new_tokens=8192,
)

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)

The future of aquif AI leans towards both dense and Mixture of Experts models, which are smarter and more efficient for inference. We can't wait to see what you are going to create with aquif-3.

Downloads last month
20
Safetensors
Model size
6.67B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aquiffoo/aquif-3-moe-p1-6.7b-a1b

Finetuned
(2)
this model

Collections including aquiffoo/aquif-3-moe-p1-6.7b-a1b