Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-V2-FP32

THIS IS THE UNQUANTIZED VERSION

Model Description

This model is a improved version of the Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill model. The distillation process remains the same but with several algorithmic improvements in the SVD distillation pipeline. The resulting model is a much better version of the distilled model.

This model is a new iteration of the distilled version of Qwen/Qwen3-30B-A3B-Thinking designed to inherit the reasoning and behavioral characteristics of its much larger teacher model, deepseek-ai/DeepSeek-V3.1.

It is the result of applying a LoRA created via an SVD-based distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 256-expert teacher model into the more efficient 48-layer, 128-expert architecture of the student model.

The primary goal was to explore the high-fidelity transfer of complex reasoning patterns, particularly those encoded within the Mixture-of-Experts (MoE) layers, from a frontier-class model to a consumer-accessible one.

You should notice that the model has a more confident and linear chain-of-thought compared to the base qwen3-30b-a3b-thinking-2507 model like Deepseek 3.1 has. This distill tends to overthink much less than the base model and provides more accurate better structured answers.

The Distillation Methodology

This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation SVD based distillation process.

Core Components

Teacher Model: deepseek-ai/DeepSeek-V3.1.
Student Model: Qwen/Qwen3-30B-A3B-Thinking-2507.
LoRA Rank: A high rank of r=2048 was used for all modules to ensure a comprehensive capture of information from the teacher model.

The Distillation Pipeline

For each corresponding layer in the student and teacher, the following pipeline was executed:

Teacher Layer Interpolation (SLERP): For student layers that fall between two teacher layers (based on a sigmoid mapping), Spherical Linear Interpolation (SLERP) was used to create a geometrically sound blend of the teacher's weights. This preserves the integrity of the high-dimensional representations.
SVD Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed using a randomized SVD algorithm. The top 2048 most significant components were selected and reconstructed to fit the student layer's smaller dimensions. This high-rank projection is designed for maximum fidelity.
Generalized Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally aligned with the student's original pre-trained tensor using a hardened least-squares solver. This alignment minimizes representational distance before calculating the final difference, with added checks to prevent numerical instability.
DARE-TIES Purification: The difference tensor (Distilled - Aligned Student) was then purified using the DARE-TIES methodology. This process drops a significant percentage (80%) of the lowest-magnitude values, treating them as noise, and then rescale the remaining important differences. This creates a clean, high-signal delta for the final LoRA.

Mixture-of-Experts (MoE) Distillation

The standout feature of this process is the full distillation of the MoE layers, which are critical for nuanced, context-dependent reasoning.

Expert Fingerprinting & Clustering: To map the 256 teacher experts to the 128 student experts, each teacher expert was "fingerprinted" by concatenating its constituent weight matrices. FAISS-GPU K-Means clustering was then used to efficiently group these 256 fingerprints into 128 distinct clusters based on their geometric similarity.
Advanced Expert Synthesis: Each of the student's 128 experts was synthesized from a weighted blend of the teacher experts assigned to its cluster. This blend is not a simple average; instead, it uses an SVD-based reconstruction from the top teacher experts (ranked by similarity to the cluster centroid) to create a new, synthetic expert that represents the core "concept" of that cluster. This more advanced synthesis aims to create novel, yet faithful, expert representations.

Intended Use

This model is designed for a range of natural language processing tasks in both commercial and general-purpose contexts. It is an iteration of its base model, developed to provide more nuanced reasoning capabilities for complex tasks.

Applications:

Commercial Functions

Conversational AI: Deployment in automated systems for customer support, handling inquiries, and routing issues.

Content Generation: Automated drafting of structured written materials, including marketing copy, technical documentation, and business communications.

Sales and Marketing Analysis: Application in sales workflows for tasks such as lead scoring, content personalization, and sentiment analysis of market data.

Data Analysis: Processing and analysis of unstructured text to identify patterns, extract specified information, and support business intelligence functions.

Human Resources: Use in recruitment processes for functions like resume screening and initial candidate assessment against defined criteria.

Operations: Application in operational workflows, including market trend analysis for demand forecasting and the automation of document processing (e.g., invoices, purchase orders).

Software Development: Integration into development environments to assist with code generation, debugging, and algorithm explanation.

General-Purpose Tasks:

Complex Problem-Solving: Logical reasoning and decomposition of complex problems for analysis.

Coding Assistance: Generation, debugging, and explanation of programming code and concepts.

Text Composition: Drafting and editing various forms of written content, including technical, professional, and academic documents.

Information Management: Text summarization and the extraction of specific data points from large bodies of text.

Question Answering: Responding to factual and explanatory questions based on its underlying training data.

Out of Scope

High-Stakes Domains: The model is not suitable for deployment in high-risk environments where an error could result in significant harm. Such applications include, but are not limited to, medical diagnosis, financial advice, or the control of critical infrastructure. Use in these domains requires rigorous testing and human oversight.