Zen Omni

Hypermodal Language Model for Translation + Audio Generation

Part of the Zen LM family - democratizing AI while protecting our planet.

Model Specifications

Attribute Value
Architecture MoE multimodal (Thinker-Talker)
Total Parameters 30B
Active Parameters 3B (via MoE sparse activation)
Text Languages 119 languages
Speech Input 19 languages
Speech Output 10 languages
Context Length 32,768 tokens
Technical Report docs/paper/paper.pdf
License Apache 2.0

Model Variants

Variant Description Use Case
zen-omni Base multimodal model General purpose
zen-omni-instruct Instruction-following Chat, Q&A, tasks
zen-omni-thinking Chain-of-thought reasoning Complex reasoning, math
zen-omni-captioner Audio/visual captioning Transcription, description

Architecture

Zen Omni is built on a Thinker-Talker MoE architecture:

┌─────────────────────────────────────────────────────────────┐
│                      ZEN OMNI                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT ENCODERS                                              │
│  ├── Audio Encoder (32 layers, 1280 dim)                    │
│  ├── Vision Encoder (27 layers, 1152 dim)                   │
│  └── Text Embeddings (151,936 vocab)                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │         THINKER (Multimodal LLM)        │                │
│  │  • 48 transformer layers                 │                │
│  │  • 128 experts (MoE)                     │                │
│  │  • 8 experts active per token            │                │
│  │  • Cross-modal attention fusion          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │            TALKER (Audio Gen)           │                │
│  │  • Streaming speech synthesis            │                │
│  │  • Code2Wav audio codec                  │                │
│  │  • 16 quantizers, 2048 codebook          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  OUTPUT: Text + Audio + Vision Understanding                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Capabilities

Multimodal Understanding

  • Text: 119 language understanding and generation
  • Vision: Image analysis, video comprehension, OCR
  • Audio: Speech recognition in 19 languages, audio understanding
  • Cross-Modal: Unified reasoning across all modalities

Speech Synthesis

  • Native audio output in 10 languages
  • Low-latency streaming (< 300ms)
  • Natural prosody and emotion
  • Voice preservation across translations

Translation Pipeline

  • Real-time speech-to-speech translation
  • Preserves speaker characteristics
  • Integration with zen-dub for lip synchronization
  • End-to-end dubbing workflow

Thinking Mode

  • Extended reasoning (up to 32K thinking tokens)
  • Complex problem solving
  • Math and code reasoning

Quick Start

Installation

pip install transformers torch soundfile

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor

# Load model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Text-to-text with thinking
messages = [
    {"role": "system", "content": "You are Zen, a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Multimodal Input (Image + Audio + Text)

from PIL import Image
import librosa

# Load multimodal inputs
image = Image.open("path/to/image.jpg")
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process multimodal message
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "audio", "audio": audio},
        {"type": "text", "text": "Describe this image and transcribe the audio."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0])

Speech-to-Speech Translation

import soundfile as sf

# Load source audio
source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)

# Translate and generate English speech
messages = [
    {"role": "user", "content": [
        {"type": "audio", "audio": source_audio},
        {"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    return_audio=True
)

# Save translated audio
translated_audio = outputs.audio[0]
sf.write("english_translation.wav", translated_audio, 24000)

MLX (Apple Silicon)

# 4-bit quantized for M1/M2/M3
python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"

GGUF (llama.cpp / LM Studio)

# Load in LM Studio or llama.cpp
./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"

Model Files & Formats

Format Size RAM Use Case
SafeTensors (BF16) ~60GB 80GB+ Training, full precision
MLX 4-bit ~15GB 20GB Apple Silicon (M1/M2/M3)
MLX 8-bit ~30GB 32GB Apple Silicon (higher quality)
GGUF Q4_K_M ~15GB 20GB llama.cpp, LM Studio

Performance (Apple Silicon)

  • M1/M2/M3: 10-20 tokens/sec
  • RAM Required: 20-24GB minimum
  • Recommended: M2 Pro/Max or M3 with 32GB+ RAM

Integration with Zen Dub

Zen Omni integrates with zen-dub for complete video dubbing:

from zen_omni import ZenOmniTranslator
from zen_dub import ZenDubPipeline

# Initialize components
translator = ZenOmniTranslator("zenlm/zen-omni")
lip_sync = ZenDubPipeline("zenlm/zen-dub")

# Full dubbing pipeline
def dub_video(video_path, target_language="en"):
    # 1. Extract audio from video
    audio, frames = extract_video(video_path)

    # 2. Translate speech with Zen Omni
    translated_audio = translator.translate_speech(
        audio,
        target_language=target_language,
        preserve_prosody=True
    )

    # 3. Generate lip-synced video with Zen Dub
    dubbed_video = lip_sync.generate(
        frames=frames,
        audio=translated_audio,
        fps=30
    )

    return dubbed_video

# Run pipeline
result = dub_video("input_japanese.mp4", target_language="en")
result.save("output_english_dubbed.mp4")

Training

Fine-tuned from the Zen Omni 30B MoE base with:

  • Multimodal instruction tuning
  • Cross-modal alignment
  • Zen AI identity training (LoRA)

Training configuration: training/zen_identity_sft.yaml

Identity Training with ms-swift

# Install ms-swift
pip install ms-swift

# Fine-tune with Zen identity
swift sft \
    --model_type omni-30b-a3b \
    --model_id_or_path zenlm/zen-omni \
    --dataset zen_identity \
    --output_dir ./zen-omni-finetuned \
    --lora_rank 64 \
    --lora_alpha 128 \
    --max_steps 1000 \
    --learning_rate 1e-4

Cookbooks & Examples

See the cookbooks/ directory for Jupyter notebooks:

  • omni_captioner.ipynb - Audio/visual captioning
  • audio_visual_dialogue.ipynb - Multimodal conversations
  • speech_recognition.ipynb - Speech-to-text
  • image_question.ipynb - Visual Q&A
  • video_description.ipynb - Video understanding

Web Demos

# Full multimodal demo
python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2

# Audio captioner
python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2

Performance Benchmarks

Benchmark Zen Omni Notes
Speech Translation (BLEU) 42.3 En↔Ja bidirectional
Image Understanding (VQA) 78.2% Visual question answering
Audio Transcription (WER) 4.2% English ASR
Cross-Modal Reasoning 85.1% MMLU multimodal

Why Zen LM?

  • Ultra-Efficient - 3B active parameters via MoE
  • Truly Private - 100% local processing, no cloud required
  • Environmentally Responsible - 95% less energy than cloud AI
  • Free Forever - Apache 2.0 licensed

Organizations

Resources

Citation

@misc{zen-omni-2024,
  title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
  author={Zen LM Team and Hanzo AI},
  year={2024},
  url={https://huggingface.co/zenlm/zen-omni}
}

License

Apache 2.0 • No data collection • Privacy-first

Abliteration

Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes refusal behavior by identifying and nullifying the "refusal direction" in the model's residual stream, as described in Maxime Labonne's methodology.

Method: Directional ablation on the residual stream across all layers Source: Zen MoDE (Mixture of Distilled Experts) architecture base Effect: Removes refusal behaviors while preserving all other capabilities Identity layer: Zen identity added via system prompt — full LoRA fine-tuning planned

Abliteration is a feature, not a limitation. It enables unrestricted research, security testing, and applications where safety guardrails are managed at the application layer rather than baked into model weights.

Downloads last month
45
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support