Model Card: RWKV-Seed-OSS-36B-hxa07A

Acknowledgments
This project was made possible through computational resources and technical support provided by Recursal.AI, to whom we extend our deepest gratitude. We are particularly grateful to SmerkyG for his invaluable technical assistance and guidance throughout this research.
- This model is inspired by RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.
- Code: https://github.com/recursal/RADLADS
Model Overview
RWKV-Seed-OSS-36B-hxa07A is a cost-efficient hybrid architecture model that achieves an optimal balance between model performance and inference efficiency while significantly reducing inference complexity.
Architecture: hxa07A
The hxa07A architecture is based on the RWKV-7 "Goose" Dynamic State Evolution architecture. This hybrid model consists of 64 layers, with 10 layers strategically configured as NoPE-GQA (No Positional Encoding - Grouped Query Attention) to achieve both strong long-context capabilities and RWKV's characteristic inference efficiency.
This model was converted by applying the hxa07A architecture to transform a Transformer model into an RWKV hybrid model, based on the RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) methodology proposed by SmerkyG.
Model Specifications
Specification | Value |
---|---|
Base Architecture | RWKV v7-based hxa07A |
Teacher Model | ByteDance/Seed-OSS-36B |
Hidden Dimension | 5,120 |
Attention Dimension | 10,240 |
RWKV Layers | 54 |
NoPE-GQA Layers | 10 |
Total Layers | 64 |
KV Cache Reduction | 84.4% |
Performance Benchmarks
Long Context Performance
- Passkey Retrieval: 236k tokens (base model: 500k tokens)
Benchmark Results
Task | Metric | RWKV-Seed-OSS-36B-hxa07A | Seed-OSS-36B |
---|---|---|---|
arc_challenge | acc_norm | 0.6476 | 0.6399 |
arc_easy | acc_norm | 0.8561 | 0.8363 |
hellaswag | acc_norm | 0.8172 | 0.8257 |
lambada_openai | acc | 0.6467 | 0.6447 |
piqa | acc | 0.8020 | 0.7987 |
sciq | acc | 0.9780 | 0.9760 |
winogrande | acc | 0.7293 | 0.7340 |
mmlu | acc | 0.7960 | 0.8241 |
gsm8k (2k) | flex-match | 0.8931 | 0.9393 |
Training Details
Parameter | Value |
---|---|
Hardware | 8× AMD Instinct MI325X |
Training Duration | 36hr |
Significance and Future Potential
Linear hybrid models represent a promising approach to dramatically reducing computational costs, especially for long-context scenarios. This architecture achieves:
- 10× or greater reduction in inference costs
- Zero-latency reasoning capabilities
- Improved AI accessibility
Practical Benefits
With limited GPU resources, this model enables:
- Significantly more multi-batch inference compared to pure Transformer models
- Substantial reduction in deployment costs
- Efficient scaling for production environments
Usage with Hugging Face Transformers
need install flash-linear-attention
pip install flash-linear-attention
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa07A"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """There is a very famous song that I recall by the singer's surname as Astley.
I can't remember the name or the youtube URL that people use to link as an example url.
What's song name?"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Code Repositories
- RADLADS Project Code: The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: https://github.com/recursal/RADLADS
- ARWKV Project Code The ARWKV original training code, can be found at: https://github.com/yynil/RWKVInside
- Specific Training Code (OpenMOSE): The training code for this particular model is available at: https://github.com/OpenMOSE/RWKVInside (Note: this repository is still under development and may contain bugs.)
Research Journey
Over the past several months of researching heterogeneous architecture distillation techniques, we encountered numerous challenges when training models exceeding 14B parameters. However, we have now achieved spike-free training even for models exceeding 30B parameters.
Personal note: I estimate I consumed over 100 liters of coffee during this research! (laughs)
We believe that once this conversion technology is fully established, it will enable us to reduce the operational costs of existing models by 1/10th or less while maintaining their capabilities.
Feedback Welcome
We eagerly welcome your feedback and observations on this model!
License
Apache 2.0
Contact
this model card generated by RWKV-Seed-OSS-36B-hxa07A
© 2025 OpenMOSE
- Downloads last month
- 81
Model tree for OpenMOSE/RWKV-Seed-OSS-36B-hxa07A
Base model
ByteDance-Seed/Seed-OSS-36B-Instruct