RWKV-Seed-OSS-36B-hxa079

Acknowledgment

This project received computational resources and technical support from Recursal.AI. I'm deeply grateful for their support!

This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the RADLADS method.


Model Overview

  • Model Name: RWKV-Seed-OSS-36B-hxa079
  • Architecture: RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention
  • Base Model: ByteDance-Seed/Seed-OSS-36B-Instruct
  • Model Revision: alpha
  • Parameters: ~37.1B
  • Context Window (Passkey): 130k

Architecture Details

  • RWKV Layers: Interleaved RWKV blocks based on the hxa079 design

  • Transformer Layers: Placed at strategic depths to enhance long-context performance

  • Hybrid Design:

    • RWKV provides temporal decay and efficient recurrent-style state handling
    • NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding
  • LoRA Customization:

    • Rank Decay: 448
    • ICLR: 192
    • Value Residual Mix: 128
    • Key Residual Mix: 128
    • Gate: 576
  • RoPE Usage: Enabled (use_rope: true), aligning positional encoding with RWKV blocks


Key Hyperparameters

  • Hidden Size: 5120
  • Intermediate Size: 27,648
  • Head Dimension: 128
  • Attention Heads: 80
  • Key/Value Heads: 8
  • Hidden Layers: 64
  • Max Position Embeddings: 524,288
  • Activation: SiLU
  • Dropout: 0.1 (residual & attention)
  • Bias: Disabled for MLP & Attention Output

Evaluation

Performance evaluation is ongoing. The model shows promising results in:

  • Maintaining base model capabilities while achieving linear attention efficiency
  • Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
  • Competitive performance on standard language modeling benchmarks
  • mmlu: 78.39%(Base 82.41%)
  • gsm8k: 86.88%(Base93.93%) with gentoken=2048
  • passkey 130k+(Base 500k)

Usage with RWKV-Infer

Usage with Hugging Face Transformers

need install flash-linear-attention

pip install flash-linear-attention
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """There is a very famous song that I recall by the singer's surname as Astley.
 I can't remember the name or the youtube URL that people use to link as an example url.
 What's song name?"""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Code Repositories

Model Card Contact

OpenMOSE - 2025

Downloads last month
36
Safetensors
Model size
37.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Seed-OSS-36B-hxa079

Finetuned
(10)
this model