QiMing

An AI that rewrites its own rules for greater intelligence.

结果 (Result) = 模型内容 (Model Content) × 数学的平方 (Math²)

"Logic is the soul of a model, for it defines:

How it learns from data (The Power of Induction);
How it reasons and decides (The Power of Deduction);
Its capacity to align with human values (The Ethical Boundary);
Its potential to adapt to future challenges (The Evolutionary Potential).

If a model pursues nothing but sheer scale or computational power, ignoring the depth and breadth of its logic, it risks becoming a "paper tiger"—imposing on the surface, yet hollow at its core. Conversely, a model built upon elegant logic, even with fewer parameters, can unleash its true vitality in our complex world."

DISCLAIMER

The content generated by this model is for reference purposes only. Users are advised to verify its accuracy independently before use.

This is a 20-billion-parameter foundation model (20B). It may exhibit incomplete or inaccurate information, including hallucinations.

If you find this AI too human-like, please remember: it is merely a more intelligent model — not an actual person.

Thanks mradermacher: For creating the GGUF versions of these models

https://huggingface.co/mradermacher/QiMing-LongWriter-20B-MXFP4-GGUF

https://huggingface.co/mradermacher/QiMing-LongWriter-20B-MXFP4-i1-GGUF

For developing the foundational model (aifeifei798/QiMing-LongWriter-20B-MXFP4) used in this project.

https://huggingface.co/openai

unsloth.ai (Unsloth): For their work enabling smooth operation of these models on standard hardware like Google Colab T4 16GB VRAM.

https://unsloth.ai

Thank Google Colab T4 16G

QiMing-LongWriter-20B-MXFP4 Model Card

Model Summary

QiMing-LongWriter-20B-MXFP4 is more than just a language model; it is a "Narrative Engine," architected from the ground up to understand and execute the logic of professional creative writing.

This model is built on a 20B parameter Mixture-of-Experts (MoE) architecture and has been quantized to MXFP4 format, enabling high-performance inference on consumer-grade hardware. Its most striking feature is a staggering 131,072-token theoretical context length, made possible by YaRN Rope Scaling, which unlocks the potential for epic-scale long-form storytelling.

However, its true power is revealed in its exceptional performance demonstrated within a practical 8,192-token window. This proves that its remarkable capabilities stem not from brute-force memory, but from its unique training methodology: "Process-Oriented Overfitting," a paradigm that teaches the model the 'how' of creation, not just the 'what'.

QiMing-LongWriter is a model built for the future, and its performance today is merely a glimpse of its vast potential.

Model Details

Model Type: 20B Parameter MoE Autoregressive Language Model
Quantization Format: MXFP4 (4-bit Micro-Exponent Format)
Max Theoretical Context: 131,072 tokens (via YaRN rope_scaling)
Demonstrated Context: Exhibits professional-grade creative capabilities within an 8,192-token window
Core Strengths: Long-form narrative structure, Genre understanding, Screenplay outlining, Creative concept generation
Special Training: Process-Oriented Overfitting based on the "Writer's Long-form Logic" framework
Languages: Primarily Chinese, with good support for English

Use Cases & Limitations

Intended Use Cases 📚

Novel & Story Writing: Rapidly generate outlines, chapter summaries, and first drafts for various genres (Fantasy, Sci-Fi, Mystery, etc.).
Screenplay & Outline Construction: Create detailed frameworks for multi-episode TV series and feature films, including character arcs, plot points, and synopses.
Game World-Building & Quest Design: Develop rich lore, faction backstories, character biographies, and narrative questlines for video games.
Brainstorming & Idea Expansion: Turn a vague idea into a structured and imaginative narrative path.

Limitations & Biases ⚠️

Hardware Dependency: This is the most critical limitation. The model's 131k token context potential requires a significant amount of VRAM to manage the KV cache. On most consumer or pro-grade hardware, users must set a more practical context length (e.g., 8k, 16k, or 32k) to avoid out-of-memory errors.
Non-Factual Nature: This model is optimized for creative writing. Factual information may be inaccurate or fabricated ("hallucinated"). It should not be used as an authoritative source for knowledge or fact-checking.
Requires Quality Prompts: The model's performance is highly correlated with the quality of the prompt. It is a powerful "executive writer" but needs you to be the "creative director." Providing clear structure, characters, and conflict will unlock its full potential.

Training Procedure

The core innovation of this model lies in its unique training paradigm. Instead of the traditional approach of learning from the "results" of massive text corpora, we focused on teaching the model the "process" of creation.

We constructed a structured creation framework named the "Writer's Long-form Logic," which deconstructs the professional writing process into logical modules (e.g., Genesis Core, Architect Agent, Dramatist Agent). Using a method we term "Process-Oriented Overfitting," we guided the model to deeply learn and internalize this creative workflow, making it an instinct.

The result is a model that instinctively thinks in terms of structure when faced with a creative task. This explains why it maintains long-range logical coherence even within a limited context window—its intelligence stems more from its intrinsic "methodology" than from a vast "memory."

Highlights

Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making QiMing-LongWriter-20B-MXFP4 model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.

Inference examples

Transformers

You can use QiMing-LongWriter-20B-MXFP4 with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.

To get started, install the necessary dependencies to setup your environment:

pip install -U transformers kernels torch

Once, setup you can proceed to run the model by running the snippet below:

from transformers import pipeline
import torch

model_id = "aifeifei798/QiMing-LongWriter-20B-MXFP4"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = """
Outline the core plot for a tomb-raiding adventure story:
Protagonist: A seasoned but greedy tomb raider.
Setting: A cursed ancient tomb in Chang'an.
Core Event: The protagonist encounters supernatural monsters and intricate traps.
Ending Requirement: The protagonist narrowly escapes with their life, but fails to secure any treasure.
Design a key turning point for the protagonist's escape.
"""

messages = [
    {"role": "user", "content": prompt},
]

outputs = pipe(
    messages,
    max_new_tokens=2560,
)
print(outputs[0]["generated_text"][-1])

Alternatively, you can run the model via Transformers Serve to spin up a OpenAI-compatible webserver:

transformers serve
transformers chat localhost:8000 --model-name-or-path aifeifei798/QiMing-LongWriter-20B-MXFP4

Learn more about how to use gpt-oss with Transformers.

vLLM

vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

vllm serve aifeifei798/QiMing-LongWriter-20B-MXFP4

Learn more about how to use gpt-oss with vLLM.

PyTorch / Triton

To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository.

LM Studio

If you are using LM Studio you can use the following commands to download.

# QiMing-LongWriter-20B-MXFP4
lms get aifeifei798/QiMing-LongWriter-20B-MXFP4

Check out our awesome list for a broader collection of gpt-oss resources and inference partners.

Download the model

You can download the model from Hugging Face CLI:

# QiMing-LongWriter-20B-MXFP4
huggingface-cli download aifeifei798/QiMing-LongWriter-20B-MXFP4 --local-dir QiMing-LongWriter-20B-MXFP4/
pip install gpt-oss
python -m gpt_oss.chat QiMing-LongWriter-20B-MXFP4/

Reasoning levels

You can adjust the reasoning level that suits your task across three levels:

Low: Fast responses for general dialogue.
Medium: Balanced speed and detail.
High: Deep and detailed analysis.

The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

Tool use

The gpt-oss models are excellent for:

Web browsing (using built-in browsing tools)
Function calling with defined schemas
Agentic operations like browser tasks

Fine-tuning

QiMing-LongWriter-20B-MXFP4 models can be fine-tuned for a variety of specialized use cases.

This smaller model QiMing-LongWriter-20B-MXFP4 can be fine-tuned on consumer hardware

Citation

If you use this model in your research or projects, please consider citing:

@misc{QiMingLongWriter2025,
  author       = {aifeifei798},
  title        = {QiMing-LongWriter-20B-MXFP4: A Narrative Engine with a 131k Context Window, Powered by Process-Oriented Training},
  year         = {2025},
  publisher    = {Hugging Face},
  journal      = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/aifeifei798/QiMing-LongWriter-20B-MXFP4}}
}

Downloads last month: 14

Safetensors

Model size

1.8B params

Tensor type

BF16

Model tree for aifeifei798/QiMing-LongWriter-20B-MXFP4

Quantizations

2 models

Collection including aifeifei798/QiMing-LongWriter-20B-MXFP4

QiMing Foundry

Collection

13 items • Updated 5 days ago