Model Card: RWKV-Seed-OSS-36B-hxa07A

PRWKV

Acknowledgments

This project was made possible through computational resources and technical support provided by Recursal.AI, to whom we extend our deepest gratitude. We are particularly grateful to SmerkyG for his invaluable technical assistance and guidance throughout this research.

Model Overview

RWKV-Seed-OSS-36B-hxa07A is a cost-efficient hybrid architecture model that achieves an optimal balance between model performance and inference efficiency while significantly reducing inference complexity.

Architecture: hxa07A

The hxa07A architecture is based on the RWKV-7 "Goose" Dynamic State Evolution architecture. This hybrid model consists of 64 layers, with 10 layers strategically configured as NoPE-GQA (No Positional Encoding - Grouped Query Attention) to achieve both strong long-context capabilities and RWKV's characteristic inference efficiency.

This model was converted by applying the hxa07A architecture to transform a Transformer model into an RWKV hybrid model, based on the RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) methodology proposed by SmerkyG.

Model Specifications

Specification Value
Base Architecture RWKV v7-based hxa07A
Teacher Model ByteDance/Seed-OSS-36B
Hidden Dimension 5,120
Attention Dimension 10,240
RWKV Layers 54
NoPE-GQA Layers 10
Total Layers 64
KV Cache Reduction 84.4%

Performance Benchmarks

Long Context Performance

  • Passkey Retrieval: 236k tokens (base model: 500k tokens)

Benchmark Results

Task Metric RWKV-Seed-OSS-36B-hxa07A Seed-OSS-36B
arc_challenge acc_norm 0.6476 0.6399
arc_easy acc_norm 0.8561 0.8363
hellaswag acc_norm 0.8172 0.8257
lambada_openai acc 0.6467 0.6447
piqa acc 0.8020 0.7987
sciq acc 0.9780 0.9760
winogrande acc 0.7293 0.7340
mmlu acc 0.7960 0.8241
gsm8k (2k) flex-match 0.8931 0.9393

Training Details

Parameter Value
Hardware 8× AMD Instinct MI325X
Training Duration 36hr

Significance and Future Potential

Linear hybrid models represent a promising approach to dramatically reducing computational costs, especially for long-context scenarios. This architecture achieves:

  • 10× or greater reduction in inference costs
  • Zero-latency reasoning capabilities
  • Improved AI accessibility

Practical Benefits

With limited GPU resources, this model enables:

  • Significantly more multi-batch inference compared to pure Transformer models
  • Substantial reduction in deployment costs
  • Efficient scaling for production environments

Usage with Hugging Face Transformers

need install flash-linear-attention

pip install flash-linear-attention
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa07A"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """There is a very famous song that I recall by the singer's surname as Astley.
 I can't remember the name or the youtube URL that people use to link as an example url.
 What's song name?"""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Code Repositories

Research Journey

Over the past several months of researching heterogeneous architecture distillation techniques, we encountered numerous challenges when training models exceeding 14B parameters. However, we have now achieved spike-free training even for models exceeding 30B parameters.

Personal note: I estimate I consumed over 100 liters of coffee during this research! (laughs)

We believe that once this conversion technology is fully established, it will enable us to reduce the operational costs of existing models by 1/10th or less while maintaining their capabilities.

Feedback Welcome

We eagerly welcome your feedback and observations on this model!

License

Apache 2.0

Contact

https://x.com/_m0se_


this model card generated by RWKV-Seed-OSS-36B-hxa07A

© 2025 OpenMOSE


Downloads last month
81
Safetensors
Model size
37B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Seed-OSS-36B-hxa07A

Finetuned
(10)
this model