Model Card: RWKV-Seed-OSS-36B-hxa07A

Acknowledgments

This project was made possible through computational resources and technical support provided by Recursal.AI, to whom we extend our deepest gratitude. We are particularly grateful to SmerkyG for his invaluable technical assistance and guidance throughout this research.

This model is inspired by RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.
Code: https://github.com/recursal/RADLADS

Model Overview

RWKV-Seed-OSS-36B-hxa07A is a cost-efficient hybrid architecture model that achieves an optimal balance between model performance and inference efficiency while significantly reducing inference complexity.

Architecture: hxa07A

The hxa07A architecture is based on the RWKV-7 "Goose" Dynamic State Evolution architecture. This hybrid model consists of 64 layers, with 10 layers strategically configured as NoPE-GQA (No Positional Encoding - Grouped Query Attention) to achieve both strong long-context capabilities and RWKV's characteristic inference efficiency.

This model was converted by applying the hxa07A architecture to transform a Transformer model into an RWKV hybrid model, based on the RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) methodology proposed by SmerkyG.

Model Specifications

Specification	Value
Base Architecture	RWKV v7-based hxa07A
Teacher Model	ByteDance/Seed-OSS-36B
Hidden Dimension	5,120
Attention Dimension	10,240
RWKV Layers	54
NoPE-GQA Layers	10
Total Layers	64
KV Cache Reduction	84.4%

Performance Benchmarks

Long Context Performance

Passkey Retrieval: 236k tokens (base model: 500k tokens)

Benchmark Results

Task	Metric	RWKV-Seed-OSS-36B-hxa07A	Seed-OSS-36B
arc_challenge	acc_norm	0.6476	0.6399
arc_easy	acc_norm	0.8561	0.8363
hellaswag	acc_norm	0.8172	0.8257
lambada_openai	acc	0.6467	0.6447
piqa	acc	0.8020	0.7987
sciq	acc	0.9780	0.9760
winogrande	acc	0.7293	0.7340
mmlu	acc	0.7960	0.8241
gsm8k (2k)	flex-match	0.8931	0.9393

Training Details

Parameter	Value
Hardware	8× AMD Instinct MI325X
Training Duration	36hr

Significance and Future Potential

Linear hybrid models represent a promising approach to dramatically reducing computational costs, especially for long-context scenarios. This architecture achieves:

10× or greater reduction in inference costs
Zero-latency reasoning capabilities
Improved AI accessibility

Practical Benefits

With limited GPU resources, this model enables:

Significantly more multi-batch inference compared to pure Transformer models
Substantial reduction in deployment costs
Efficient scaling for production environments

Usage with Hugging Face Transformers

need install flash-linear-attention

pip install flash-linear-attention

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa07A"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """There is a very famous song that I recall by the singer's surname as Astley.
 I can't remember the name or the youtube URL that people use to link as an example url.
 What's song name?"""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Code Repositories

RADLADS Project Code: The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: https://github.com/recursal/RADLADS
ARWKV Project Code The ARWKV original training code, can be found at: https://github.com/yynil/RWKVInside
Specific Training Code (OpenMOSE): The training code for this particular model is available at: https://github.com/OpenMOSE/RWKVInside (Note: this repository is still under development and may contain bugs.)

Research Journey

Over the past several months of researching heterogeneous architecture distillation techniques, we encountered numerous challenges when training models exceeding 14B parameters. However, we have now achieved spike-free training even for models exceeding 30B parameters.

Personal note: I estimate I consumed over 100 liters of coffee during this research! (laughs)

We believe that once this conversion technology is fully established, it will enable us to reduce the operational costs of existing models by 1/10th or less while maintaining their capabilities.

Feedback Welcome

We eagerly welcome your feedback and observations on this model!

License

Apache 2.0

Contact

https://x.com/_m0se_

this model card generated by RWKV-Seed-OSS-36B-hxa07A

Downloads last month: 81

Safetensors

Model size

37B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Seed-OSS-36B-hxa07A

Base model

ByteDance-Seed/Seed-OSS-36B-Instruct

Finetuned

(10)

this model