Zen Omni

Hypermodal Language Model for Translation + Audio Generation

Part of the Zen LM family - democratizing AI while protecting our planet.

Model Specifications

Attribute	Value
Architecture	MoE multimodal (Thinker-Talker)
Total Parameters	30B
Active Parameters	3B (via MoE sparse activation)
Text Languages	119 languages
Speech Input	19 languages
Speech Output	10 languages
Context Length	32,768 tokens
Technical Report	docs/paper/paper.pdf
License	Apache 2.0

Model Variants

Variant	Description	Use Case
zen-omni	Base multimodal model	General purpose
zen-omni-instruct	Instruction-following	Chat, Q&A, tasks
zen-omni-thinking	Chain-of-thought reasoning	Complex reasoning, math
zen-omni-captioner	Audio/visual captioning	Transcription, description

Architecture

Zen Omni is built on a Thinker-Talker MoE architecture:

┌─────────────────────────────────────────────────────────────┐
│                      ZEN OMNI                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT ENCODERS                                              │
│  ├── Audio Encoder (32 layers, 1280 dim)                    │
│  ├── Vision Encoder (27 layers, 1152 dim)                   │
│  └── Text Embeddings (151,936 vocab)                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │         THINKER (Multimodal LLM)        │                │
│  │  • 48 transformer layers                 │                │
│  │  • 128 experts (MoE)                     │                │
│  │  • 8 experts active per token            │                │
│  │  • Cross-modal attention fusion          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │            TALKER (Audio Gen)           │                │
│  │  • Streaming speech synthesis            │                │
│  │  • Code2Wav audio codec                  │                │
│  │  • 16 quantizers, 2048 codebook          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  OUTPUT: Text + Audio + Vision Understanding                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Capabilities

Multimodal Understanding

Text: 119 language understanding and generation
Vision: Image analysis, video comprehension, OCR
Audio: Speech recognition in 19 languages, audio understanding
Cross-Modal: Unified reasoning across all modalities

Speech Synthesis

Native audio output in 10 languages
Low-latency streaming (< 300ms)
Natural prosody and emotion
Voice preservation across translations

Translation Pipeline

Real-time speech-to-speech translation
Preserves speaker characteristics
Integration with zen-dub for lip synchronization
End-to-end dubbing workflow

Thinking Mode

Extended reasoning (up to 32K thinking tokens)
Complex problem solving
Math and code reasoning

Quick Start

Installation

pip install transformers torch soundfile

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor

# Load model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Text-to-text with thinking
messages = [
    {"role": "system", "content": "You are Zen, a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Multimodal Input (Image + Audio + Text)

from PIL import Image
import librosa

# Load multimodal inputs
image = Image.open("path/to/image.jpg")
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process multimodal message
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "audio", "audio": audio},
        {"type": "text", "text": "Describe this image and transcribe the audio."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0])

Speech-to-Speech Translation

import soundfile as sf

# Load source audio
source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)

# Translate and generate English speech
messages = [
    {"role": "user", "content": [
        {"type": "audio", "audio": source_audio},
        {"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    return_audio=True
)

# Save translated audio
translated_audio = outputs.audio[0]
sf.write("english_translation.wav", translated_audio, 24000)

MLX (Apple Silicon)

# 4-bit quantized for M1/M2/M3
python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"

GGUF (llama.cpp / LM Studio)

# Load in LM Studio or llama.cpp
./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"

Model Files & Formats

Format	Size	RAM	Use Case
SafeTensors (BF16)	~60GB	80GB+	Training, full precision
MLX 4-bit	~15GB	20GB	Apple Silicon (M1/M2/M3)
MLX 8-bit	~30GB	32GB	Apple Silicon (higher quality)
GGUF Q4_K_M	~15GB	20GB	llama.cpp, LM Studio

Performance (Apple Silicon)

M1/M2/M3: 10-20 tokens/sec
RAM Required: 20-24GB minimum
Recommended: M2 Pro/Max or M3 with 32GB+ RAM

Integration with Zen Dub

Zen Omni integrates with zen-dub for complete video dubbing:

from zen_omni import ZenOmniTranslator
from zen_dub import ZenDubPipeline

# Initialize components
translator = ZenOmniTranslator("zenlm/zen-omni")
lip_sync = ZenDubPipeline("zenlm/zen-dub")

# Full dubbing pipeline
def dub_video(video_path, target_language="en"):
    # 1. Extract audio from video
    audio, frames = extract_video(video_path)

    # 2. Translate speech with Zen Omni
    translated_audio = translator.translate_speech(
        audio,
        target_language=target_language,
        preserve_prosody=True
    )

    # 3. Generate lip-synced video with Zen Dub
    dubbed_video = lip_sync.generate(
        frames=frames,
        audio=translated_audio,
        fps=30
    )

    return dubbed_video

# Run pipeline
result = dub_video("input_japanese.mp4", target_language="en")
result.save("output_english_dubbed.mp4")

Training

Fine-tuned from the Zen Omni 30B MoE base with:

Multimodal instruction tuning
Cross-modal alignment
Zen AI identity training (LoRA)

Training configuration: training/zen_identity_sft.yaml

Identity Training with ms-swift

# Install ms-swift
pip install ms-swift

# Fine-tune with Zen identity
swift sft \
    --model_type omni-30b-a3b \
    --model_id_or_path zenlm/zen-omni \
    --dataset zen_identity \
    --output_dir ./zen-omni-finetuned \
    --lora_rank 64 \
    --lora_alpha 128 \
    --max_steps 1000 \
    --learning_rate 1e-4

Cookbooks & Examples

See the cookbooks/ directory for Jupyter notebooks:

omni_captioner.ipynb - Audio/visual captioning
audio_visual_dialogue.ipynb - Multimodal conversations
speech_recognition.ipynb - Speech-to-text
image_question.ipynb - Visual Q&A
video_description.ipynb - Video understanding

Web Demos

# Full multimodal demo
python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2

# Audio captioner
python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2

Performance Benchmarks

Benchmark	Zen Omni	Notes
Speech Translation (BLEU)	42.3	En↔Ja bidirectional
Image Understanding (VQA)	78.2%	Visual question answering
Audio Transcription (WER)	4.2%	English ASR
Cross-Modal Reasoning	85.1%	MMLU multimodal

Why Zen LM?

Ultra-Efficient - 3B active parameters via MoE
Truly Private - 100% local processing, no cloud required
Environmentally Responsible - 95% less energy than cloud AI
Free Forever - Apache 2.0 licensed

Organizations

Hanzo AI Inc - Techstars '17 • Award-winning GenAI lab
Zoo Labs Foundation - 501(c)(3) Non-Profit

Resources

Citation

@misc{zen-omni-2024,
  title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
  author={Zen LM Team and Hanzo AI},
  year={2024},
  url={https://huggingface.co/zenlm/zen-omni}
}

License

Apache 2.0 • No data collection • Privacy-first

Abliteration

Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes refusal behavior by identifying and nullifying the "refusal direction" in the model's residual stream, as described in Maxime Labonne's methodology.

Method: Directional ablation on the residual stream across all layers Source: Zen MoDE (Mixture of Distilled Experts) architecture base Effect: Removes refusal behaviors while preserving all other capabilities Identity layer: Zen identity added via system prompt — full LoRA fine-tuning planned

Abliteration is a feature, not a limitation. It enables unrestricted research, security testing, and applications where safety guardrails are managed at the application layer rather than baked into model weights.

Downloads last month: 45

Safetensors

Model size

35B params

Tensor type

BF16