🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.

🌟 Model Highlights

Architecture: Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
Multi-Scale Training (NEW): Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
Vision Encoder: SigLIP-2 (384px native) with TiTok-style 1D tokenization (256 compressed tokens), Dual-Stream Attention (2 layers), and 2D-RoPE for images; 3D-RoPE + Temporal MoE (4 experts) for video (8-32 frames).
Image Generation: MoE-DiT (Diffusion Transformer with 4 MoE experts) using Flow Matching, 2D-RoPE, and Symmetric Dual-Stream Attention (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
Video Generation: 3D Causal Transformers (4 layers) with Flow Matching, 3D-RoPE for (x,y,t) positions, and Temporal Expert Routing (4 experts). Multi-scale: 8-32 frames @ 128-384px.
Audio (Speech-to-Speech): Conformer encoder with RMLA and Raw Waveform Tokenizer for ASR; Direct waveform decoder (no vocoder needed!) with MAS for TTS; Zero-Shot Speaker Cloning with In-Context Audio Prompting. Talk to it, and it talks back!
Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
Context: Efficient 128K context using Ring Attention (4096 chunk size).
Fine-tuning: LoRA variants including rsLoRA, DoRA, and LoRA+ (r=32, α=64, 4x B matrix learning rate).
Multimodal Fusion: Cross-Attention layers (4 layers, 8 heads) + Perceiver Resampler for vision projection.
Performance: Flash Attention support with FP16-native numerical stability.

🔬 Architecture Deep Dive

🧠 LLM Backbone (MoE)

Component	Specification
Hidden Size	1024
Layers	12
Attention Heads	16
MoE Experts	8 + 1 Shared (DeepSeek-style isolation)
Experts per Token	2 (top-2 routing)
MoE Layer Frequency	Every 2 layers
Routing	Aux-Lossless MoE routing
Context Length	128K positions
Attention	Ring Attention (4096 chunk) + Flash Attention
Tokenizer	Qwen2.5 (151,643 vocab)

👁️ Vision Encoder (SigLIP-2 + SOTA Extensions)

Feature	Description
Base Model	`google/siglip-so400m-patch14-384`
Input Resolution	384×384
TiTok Tokenization	1D tokenization with 256 compressed tokens
Dual-Stream Attention	2 symmetric dual-stream layers
Position Encoding	2D-RoPE
Output Tokens	64 tokens per image

🎬 Video Encoder (3D Causal Transformers)

Feature	Description
Frame Scales	8, 12, 16, 24, 32 frames (multi-scale)
Resolution Scales	128, 192, 256, 320, 384px (multi-scale)
Position Encoding	3D-RoPE for (x, y, t) coordinates
Attention	3D Causal Self-Attention
Expert Routing	Temporal MoE (4 experts, temporally-aware)
Encoder Layers	4 layers

🎨 Image Generation (MoE-DiT + Flow Matching)

Feature	Description
Architecture	MoE-DiT (Diffusion Transformer with MoE)
Scheduler	Flow Matching (not DDPM)
Output Resolution	256-512px (multi-scale: 256, 320, 384, 448, 512)
Position Encoding	2D-RoPE
Attention	Symmetric Dual-Stream Attention (SD3/Flux-style)
MoE Experts	4 experts in DiT blocks
Inference Steps	50 steps
Guidance Scale	7.5 (CFG)

📹 Video Generation (3D Causal + Flow Matching)

Feature	Description
Output Resolution	128-384px (multi-scale: 128, 192, 256, 320, 384)
Output Frames	8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32)
Scheduler	Flow Matching
Position Encoding	3D-RoPE for (x, y, t)
Attention	Factorized Spatial-Temporal (3D Causal)
Expert Routing	Temporal MoE (4 experts)
Guidance Scale	7.5 (CFG)

📐 Multi-Scale Training Configuration

Type	Scales	Probabilities
Image	128, 192, 256, 320, 384, 448, 512px	5%, 10%, 30%, 25%, 15%, 10%, 5%
Video	128, 192, 256, 320, 384px	10%, 20%, 35%, 25%, 10%
Frames	8, 12, 16, 20, 24, 32	10%, 15%, 30%, 20%, 15%, 10%

Multi-scale training is enabled by default with random strategy - each batch samples a different scale for variety.

🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)

Feature	Description
Sample Rate	16kHz
Encoder (ASR)	Raw Waveform Tokenizer → Conformer blocks with RMLA
Waveform Decoder	BigVGAN-style with Snake activation + MRF - no external vocoder!
KV Compression	LoRA-style KV compression (rank 256)
Decoder Alignment	MAS (Monotonic Alignment Search) for text-to-audio alignment
Voice Cloning	Zero-Shot Speaker Cloning with speaker embedding (256-dim)
In-Context Prompting	Enabled for voice cloning from reference audio

🔊 Waveform Decoder (SOTA BigVGAN-style)

Direct audio output without external vocoder:

Feature	Description
Architecture	BigVGAN/HiFi-GAN style with transposed convolutions
Snake Activation	`x + sin²(αx)/α` - preserves audio periodicity
Multi-Receptive Field Fusion	Parallel residual stacks (kernels 3, 7, 11, dilations 1/3/5)
Weight Normalization	Stable training, faster convergence
Upsampling	256x total (rates: 8, 8, 2, 2) from features to 16kHz audio
Streaming	`stream_decode()` for low-latency real-time output
Output Range	[-1, 1] normalized waveform via tanh

📚 Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
Tool Use: Datasets like Function-Calling-ChatML, Synth-APIGen, and Tool-Calls-MultiTurn enable precise tool invocation across single and multi-turn interactions.
Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

🧪 Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category	Description
Anti-Hallucination	Training the model to say "I don't know" (`Synth-IDK`), verify facts (`Synth-FactCheck`), provide citations (`Synth-Citation`), express uncertainty (`Synth-Uncertainty`), and ground responses (`Synth-GroundedResponse`).
System Administration	Simulated environments for `Docker` setup, `SSH` configuration, database management, and package installation (`Synth-AptInstall`).
Code Execution	Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations	Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding.
Chain-of-Thought	Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers.
File Operations	Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Backup-bdg
/

Xoron-Dev-MultiMoe