Zen Omni
Hypermodal Language Model for Translation + Audio Generation
Part of the Zen LM family - democratizing AI while protecting our planet.
Model Specifications
| Attribute | Value |
|---|---|
| Architecture | MoE multimodal (Thinker-Talker) |
| Total Parameters | 30B |
| Active Parameters | 3B (via MoE sparse activation) |
| Text Languages | 119 languages |
| Speech Input | 19 languages |
| Speech Output | 10 languages |
| Context Length | 32,768 tokens |
| Technical Report | docs/paper/paper.pdf |
| License | Apache 2.0 |
Model Variants
| Variant | Description | Use Case |
|---|---|---|
| zen-omni | Base multimodal model | General purpose |
| zen-omni-instruct | Instruction-following | Chat, Q&A, tasks |
| zen-omni-thinking | Chain-of-thought reasoning | Complex reasoning, math |
| zen-omni-captioner | Audio/visual captioning | Transcription, description |
Architecture
Zen Omni is built on a Thinker-Talker MoE architecture:
┌─────────────────────────────────────────────────────────────┐
│ ZEN OMNI │
├─────────────────────────────────────────────────────────────┤
│ │
│ INPUT ENCODERS │
│ ├── Audio Encoder (32 layers, 1280 dim) │
│ ├── Vision Encoder (27 layers, 1152 dim) │
│ └── Text Embeddings (151,936 vocab) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ THINKER (Multimodal LLM) │ │
│ │ • 48 transformer layers │ │
│ │ • 128 experts (MoE) │ │
│ │ • 8 experts active per token │ │
│ │ • Cross-modal attention fusion │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ TALKER (Audio Gen) │ │
│ │ • Streaming speech synthesis │ │
│ │ • Code2Wav audio codec │ │
│ │ • 16 quantizers, 2048 codebook │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT: Text + Audio + Vision Understanding │
│ │
└─────────────────────────────────────────────────────────────┘
Capabilities
Multimodal Understanding
- Text: 119 language understanding and generation
- Vision: Image analysis, video comprehension, OCR
- Audio: Speech recognition in 19 languages, audio understanding
- Cross-Modal: Unified reasoning across all modalities
Speech Synthesis
- Native audio output in 10 languages
- Low-latency streaming (< 300ms)
- Natural prosody and emotion
- Voice preservation across translations
Translation Pipeline
- Real-time speech-to-speech translation
- Preserves speaker characteristics
- Integration with zen-dub for lip synchronization
- End-to-end dubbing workflow
Thinking Mode
- Extended reasoning (up to 32K thinking tokens)
- Complex problem solving
- Math and code reasoning
Quick Start
Installation
pip install transformers torch soundfile
Basic Usage
from transformers import AutoModelForCausalLM, AutoProcessor
# Load model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Text-to-text with thinking
messages = [
{"role": "system", "content": "You are Zen, a helpful AI assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Multimodal Input (Image + Audio + Text)
from PIL import Image
import librosa
# Load multimodal inputs
image = Image.open("path/to/image.jpg")
audio, sr = librosa.load("path/to/audio.wav", sr=16000)
# Process multimodal message
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "audio", "audio": audio},
{"type": "text", "text": "Describe this image and transcribe the audio."}
]}
]
inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0])
Speech-to-Speech Translation
import soundfile as sf
# Load source audio
source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)
# Translate and generate English speech
messages = [
{"role": "user", "content": [
{"type": "audio", "audio": source_audio},
{"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
]}
]
inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
return_audio=True
)
# Save translated audio
translated_audio = outputs.audio[0]
sf.write("english_translation.wav", translated_audio, 24000)
MLX (Apple Silicon)
# 4-bit quantized for M1/M2/M3
python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"
GGUF (llama.cpp / LM Studio)
# Load in LM Studio or llama.cpp
./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"
Model Files & Formats
| Format | Size | RAM | Use Case |
|---|---|---|---|
| SafeTensors (BF16) | ~60GB | 80GB+ | Training, full precision |
| MLX 4-bit | ~15GB | 20GB | Apple Silicon (M1/M2/M3) |
| MLX 8-bit | ~30GB | 32GB | Apple Silicon (higher quality) |
| GGUF Q4_K_M | ~15GB | 20GB | llama.cpp, LM Studio |
Performance (Apple Silicon)
- M1/M2/M3: 10-20 tokens/sec
- RAM Required: 20-24GB minimum
- Recommended: M2 Pro/Max or M3 with 32GB+ RAM
Integration with Zen Dub
Zen Omni integrates with zen-dub for complete video dubbing:
from zen_omni import ZenOmniTranslator
from zen_dub import ZenDubPipeline
# Initialize components
translator = ZenOmniTranslator("zenlm/zen-omni")
lip_sync = ZenDubPipeline("zenlm/zen-dub")
# Full dubbing pipeline
def dub_video(video_path, target_language="en"):
# 1. Extract audio from video
audio, frames = extract_video(video_path)
# 2. Translate speech with Zen Omni
translated_audio = translator.translate_speech(
audio,
target_language=target_language,
preserve_prosody=True
)
# 3. Generate lip-synced video with Zen Dub
dubbed_video = lip_sync.generate(
frames=frames,
audio=translated_audio,
fps=30
)
return dubbed_video
# Run pipeline
result = dub_video("input_japanese.mp4", target_language="en")
result.save("output_english_dubbed.mp4")
Training
Fine-tuned from the Zen Omni 30B MoE base with:
- Multimodal instruction tuning
- Cross-modal alignment
- Zen AI identity training (LoRA)
Training configuration: training/zen_identity_sft.yaml
Identity Training with ms-swift
# Install ms-swift
pip install ms-swift
# Fine-tune with Zen identity
swift sft \
--model_type omni-30b-a3b \
--model_id_or_path zenlm/zen-omni \
--dataset zen_identity \
--output_dir ./zen-omni-finetuned \
--lora_rank 64 \
--lora_alpha 128 \
--max_steps 1000 \
--learning_rate 1e-4
Cookbooks & Examples
See the cookbooks/ directory for Jupyter notebooks:
omni_captioner.ipynb- Audio/visual captioningaudio_visual_dialogue.ipynb- Multimodal conversationsspeech_recognition.ipynb- Speech-to-textimage_question.ipynb- Visual Q&Avideo_description.ipynb- Video understanding
Web Demos
# Full multimodal demo
python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2
# Audio captioner
python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2
Performance Benchmarks
| Benchmark | Zen Omni | Notes |
|---|---|---|
| Speech Translation (BLEU) | 42.3 | En↔Ja bidirectional |
| Image Understanding (VQA) | 78.2% | Visual question answering |
| Audio Transcription (WER) | 4.2% | English ASR |
| Cross-Modal Reasoning | 85.1% | MMLU multimodal |
Why Zen LM?
- Ultra-Efficient - 3B active parameters via MoE
- Truly Private - 100% local processing, no cloud required
- Environmentally Responsible - 95% less energy than cloud AI
- Free Forever - Apache 2.0 licensed
Organizations
- Hanzo AI Inc - Techstars '17 • Award-winning GenAI lab
- Zoo Labs Foundation - 501(c)(3) Non-Profit
Resources
Citation
@misc{zen-omni-2024,
title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
author={Zen LM Team and Hanzo AI},
year={2024},
url={https://huggingface.co/zenlm/zen-omni}
}
License
Apache 2.0 • No data collection • Privacy-first
Abliteration
Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes refusal behavior by identifying and nullifying the "refusal direction" in the model's residual stream, as described in Maxime Labonne's methodology.
Method: Directional ablation on the residual stream across all layers Source: Zen MoDE (Mixture of Distilled Experts) architecture base Effect: Removes refusal behaviors while preserving all other capabilities Identity layer: Zen identity added via system prompt — full LoRA fine-tuning planned
Abliteration is a feature, not a limitation. It enables unrestricted research, security testing, and applications where safety guardrails are managed at the application layer rather than baked into model weights.
- Downloads last month
- 45