Ring-flash-linear-2.0

Introduction

We are excited to announce the official open-source release of Ring-flash-linear-2.0!

Building on the success of our Ling 2.0 series, this model continues to leverage a powerful hybrid architecture of linear and standard attention, perfectly balancing high performance with superior efficiency. By integrating our proven MoE design with optimizations like a 1/32 expert activation ratio and MTP layers, Ring-flash-linear achieves the performance of a 40B dense model while activating only 6.1B parameters. This model was converted from Ling-flash-base-2.0, further trained on an additional 1T tokens. When it comes to benchmarks, Ring-flash-linear-2.0 not only holds its own against standard attention models (like Ring-flash-2.0) but also outperforms other open-source MoE and Dense models in its class on several demanding tasks. Plus, with support for a 128k long context, it's faster and more precise than ever, especially when handling long-form inputs and outputs.

Figure 1: Hybrid Linear Model Architecture

Evaluation

To better demonstrate the model's capabilities, we selected representative open-source thinking models and closed-source APIs for comparison. We present results on several challenging reasoning benchmarks spanning domains such as mathematics, coding, and science. Also, we evaluate the model's performance on a creative writing task (Creative Writing v3). We observe that our model achieves performance on par with other models.

Figure 2: Model Performance Comparison

Figure 3: Model Performance Comparison

Linear Attention, Highly Sparse, High-Speed Generation

Thanks to its hybrid attention mechanism and highly sparse MoE architecture, Ring-flash-linear-2.0 achieves near-linear time complexity and constant space complexity, resulting in outstanding inference efficiency. To fully demonstrate this advantage, we conducted a comparison between our model and top-tier competitors of similar size or performance. The results clearly demonstrate the advantage of our model in inference efficiency.

Figure 4: Ring-flash-linear-2.0 prefill throughput

Figure 5: Ring-flash-linear-2.0 decode throughput

Quickstart

Requirements

pip install flash-linear-attention==0.3.2
pip install transformers==4.56.1

🤗 Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ring-flash-linear-2.0"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


prompts = [
    "Give me a short introduction to large language models."
]
input_texts = []
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    input_texts.append(text)

print(input_texts)

model_inputs = tokenizer(input_texts, return_tensors="pt", return_token_type_ids=False, padding=True, padding_side='left').to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192,
    do_sample=False,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print("*" * 30)
print(responses)
print("*" * 30)

🚀 SGLang

Environment Preparation

We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:

pip install sglang==0.5.2 sgl-kernel==0.3.9.post2 vllm==0.10.2 torch==2.8.0 torchvision==0.23.0 torchao

Then you should install our sglang wheel package:

pip install http://raw.githubusercontent.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall

Run Inference

BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:

Start server:

python -m sglang.launch_server \
    --model-path <model_path> \
    --trust-remote-code \
    --tp-size 4 \
    --disable-radix-cache \
    --json-model-override-args "{\"linear_backend\": \"seg_la\"}"

Client:

curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'

More usage can be found here

🚀 vLLM

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

pip install torch==2.7.0 torchvision==0.22.0

Then you should install our vLLM wheel package:

pip install https://media.githubusercontent.com/media/inclusionAI/Ring-V2/refs/heads/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall

Offline Inference

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-linear-2.0")

sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=8192)

llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference

vllm serve inclusionAI/Ring-flash-linear-2.0 \
              --tensor-parallel-size 4 \
              --gpu-memory-utilization 0.90 \
              --no-enable-prefix-caching