RWKV-Seed-OSS-36B-hxa079

Acknowledgment

This project received computational resources and technical support from Recursal.AI. I'm deeply grateful for their support!

This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the RADLADS method.

Model Overview

Model Name: RWKV-Seed-OSS-36B-hxa079
Architecture: RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention
Base Model: ByteDance-Seed/Seed-OSS-36B-Instruct
Model Revision: alpha
Parameters: ~37.1B
Context Window (Passkey): 130k

Architecture Details

RWKV Layers: Interleaved RWKV blocks based on the hxa079 design
Transformer Layers: Placed at strategic depths to enhance long-context performance
Hybrid Design:
- RWKV provides temporal decay and efficient recurrent-style state handling
- NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding
LoRA Customization:
- Rank Decay: 448
- ICLR: 192
- Value Residual Mix: 128
- Key Residual Mix: 128
- Gate: 576
RoPE Usage: Enabled (use_rope: true), aligning positional encoding with RWKV blocks

Key Hyperparameters

Hidden Size: 5120
Intermediate Size: 27,648
Head Dimension: 128
Attention Heads: 80
Key/Value Heads: 8
Hidden Layers: 64
Max Position Embeddings: 524,288
Activation: SiLU
Dropout: 0.1 (residual & attention)
Bias: Disabled for MLP & Attention Output

Evaluation

Performance evaluation is ongoing. The model shows promising results in:

Maintaining base model capabilities while achieving linear attention efficiency
Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
Competitive performance on standard language modeling benchmarks
mmlu: 78.39%(Base 82.41%)
gsm8k: 86.88%(Base93.93%) with gentoken=2048
passkey 130k+(Base 500k)

Usage with RWKV-Infer

RWKV-Infer Triton based Hybrid RWKV Inference engine, can be check at: https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F

Usage with Hugging Face Transformers

need install flash-linear-attention

pip install flash-linear-attention

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """There is a very famous song that I recall by the singer's surname as Astley.
 I can't remember the name or the youtube URL that people use to link as an example url.
 What's song name?"""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Code Repositories

RADLADS Project Code: The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: https://github.com/recursal/RADLADS
ARWKV Project Code The ARWKV original training code, can be found at: https://github.com/yynil/RWKVInside
Specific Training Code (OpenMOSE): The training code for this particular model is available at: https://github.com/OpenMOSE/RWKVInside (Note: this repository is still under development and may contain bugs.)

Model Card Contact

OpenMOSE - 2025

Downloads last month: 36

Safetensors

Model size

37.2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Seed-OSS-36B-hxa079

Base model

ByteDance-Seed/Seed-OSS-36B-Instruct

Finetuned

(10)

this model