🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev Logo License Params Context Version

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.

🌟 Model Highlights

  • Architecture: Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
  • Multi-Scale Training (NEW): Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
  • Vision Encoder: SigLIP-2 (384px native) with TiTok-style 1D tokenization (256 compressed tokens), Dual-Stream Attention (2 layers), and 2D-RoPE for images; 3D-RoPE + Temporal MoE (4 experts) for video (8-32 frames).
  • Image Generation: MoE-DiT (Diffusion Transformer with 4 MoE experts) using Flow Matching, 2D-RoPE, and Symmetric Dual-Stream Attention (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
  • Video Generation: 3D Causal Transformers (4 layers) with Flow Matching, 3D-RoPE for (x,y,t) positions, and Temporal Expert Routing (4 experts). Multi-scale: 8-32 frames @ 128-384px.
  • Audio (Speech-to-Speech): Conformer encoder with RMLA and Raw Waveform Tokenizer for ASR; Direct waveform decoder (no vocoder needed!) with MAS for TTS; Zero-Shot Speaker Cloning with In-Context Audio Prompting. Talk to it, and it talks back!
  • Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
  • Context: Efficient 128K context using Ring Attention (4096 chunk size).
  • Fine-tuning: LoRA variants including rsLoRA, DoRA, and LoRA+ (r=32, α=64, 4x B matrix learning rate).
  • Multimodal Fusion: Cross-Attention layers (4 layers, 8 heads) + Perceiver Resampler for vision projection.
  • Performance: Flash Attention support with FP16-native numerical stability.

🔬 Architecture Deep Dive

🧠 LLM Backbone (MoE)

Component Specification
Hidden Size 1024
Layers 12
Attention Heads 16
MoE Experts 8 + 1 Shared (DeepSeek-style isolation)
Experts per Token 2 (top-2 routing)
MoE Layer Frequency Every 2 layers
Routing Aux-Lossless MoE routing
Context Length 128K positions
Attention Ring Attention (4096 chunk) + Flash Attention
Tokenizer Qwen2.5 (151,643 vocab)

👁️ Vision Encoder (SigLIP-2 + SOTA Extensions)

Feature Description
Base Model google/siglip-so400m-patch14-384
Input Resolution 384×384
TiTok Tokenization 1D tokenization with 256 compressed tokens
Dual-Stream Attention 2 symmetric dual-stream layers
Position Encoding 2D-RoPE
Output Tokens 64 tokens per image

🎬 Video Encoder (3D Causal Transformers)

Feature Description
Frame Scales 8, 12, 16, 24, 32 frames (multi-scale)
Resolution Scales 128, 192, 256, 320, 384px (multi-scale)
Position Encoding 3D-RoPE for (x, y, t) coordinates
Attention 3D Causal Self-Attention
Expert Routing Temporal MoE (4 experts, temporally-aware)
Encoder Layers 4 layers

🎨 Image Generation (MoE-DiT + Flow Matching)

Feature Description
Architecture MoE-DiT (Diffusion Transformer with MoE)
Scheduler Flow Matching (not DDPM)
Output Resolution 256-512px (multi-scale: 256, 320, 384, 448, 512)
Position Encoding 2D-RoPE
Attention Symmetric Dual-Stream Attention (SD3/Flux-style)
MoE Experts 4 experts in DiT blocks
Inference Steps 50 steps
Guidance Scale 7.5 (CFG)

📹 Video Generation (3D Causal + Flow Matching)

Feature Description
Output Resolution 128-384px (multi-scale: 128, 192, 256, 320, 384)
Output Frames 8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32)
Scheduler Flow Matching
Position Encoding 3D-RoPE for (x, y, t)
Attention Factorized Spatial-Temporal (3D Causal)
Expert Routing Temporal MoE (4 experts)
Guidance Scale 7.5 (CFG)

📐 Multi-Scale Training Configuration

Type Scales Probabilities
Image 128, 192, 256, 320, 384, 448, 512px 5%, 10%, 30%, 25%, 15%, 10%, 5%
Video 128, 192, 256, 320, 384px 10%, 20%, 35%, 25%, 10%
Frames 8, 12, 16, 20, 24, 32 10%, 15%, 30%, 20%, 15%, 10%

Multi-scale training is enabled by default with random strategy - each batch samples a different scale for variety.

🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)

Feature Description
Sample Rate 16kHz
Encoder (ASR) Raw Waveform Tokenizer → Conformer blocks with RMLA
Waveform Decoder BigVGAN-style with Snake activation + MRF - no external vocoder!
KV Compression LoRA-style KV compression (rank 256)
Decoder Alignment MAS (Monotonic Alignment Search) for text-to-audio alignment
Voice Cloning Zero-Shot Speaker Cloning with speaker embedding (256-dim)
In-Context Prompting Enabled for voice cloning from reference audio

🔊 Waveform Decoder (SOTA BigVGAN-style)

Direct audio output without external vocoder:

Feature Description
Architecture BigVGAN/HiFi-GAN style with transposed convolutions
Snake Activation x + sin²(αx)/α - preserves audio periodicity
Multi-Receptive Field Fusion Parallel residual stacks (kernels 3, 7, 11, dilations 1/3/5)
Weight Normalization Stable training, faster convergence
Upsampling 256x total (rates: 8, 8, 2, 2) from features to 16kHz audio
Streaming stream_decode() for low-latency real-time output
Output Range [-1, 1] normalized waveform via tanh

📚 Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

  • Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
  • Tool Use: Datasets like Function-Calling-ChatML, Synth-APIGen, and Tool-Calls-MultiTurn enable precise tool invocation across single and multi-turn interactions.
  • Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
  • Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
  • Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

🧪 Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category Description
Anti-Hallucination Training the model to say "I don't know" (Synth-IDK), verify facts (Synth-FactCheck), provide citations (Synth-Citation), express uncertainty (Synth-Uncertainty), and ground responses (Synth-GroundedResponse).
System Administration Simulated environments for Docker setup, SSH configuration, database management, and package installation (Synth-AptInstall).
Code Execution Traces of code execution including Shell errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding.
Chain-of-Thought Explicit Synth-CoT data to encourage internal reasoning before generating final answers.
File Operations Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Backup-bdg/Xoron-Dev-MultiMoe