File size: 4,269 Bytes
3776c5d
 
 
 
dd60b35
 
 
 
 
1eefc18
dd60b35
 
3a4be1d
 
dd60b35
 
 
 
84b81e1
 
 
 
 
 
dd60b35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab1ef37
dd60b35
ab1ef37
 
 
 
0f51424
 
 
ab1ef37
 
 
 
 
 
 
af76cb8
 
 
 
 
ab1ef37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
base_model:
- ByteDance-Seed/Seed-OSS-36B-Instruct
---

# RWKV-Seed-OSS-36B-hxa079

**Acknowledgment**

This project received computational resources and technical support from **Recursal.AI**. I'm deeply grateful for their support!

This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the **RADLADS** method.

---

## Model Overview

* **Model Name:** RWKV-Seed-OSS-36B-hxa079
* **Architecture:** RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention
* **Base Model:** ByteDance-Seed/Seed-OSS-36B-Instruct
* **Model Revision:** alpha
* **Parameters:** ~37.1B
* **Context Window (Passkey):** 130k

---

## Architecture Details

* **RWKV Layers:** Interleaved RWKV blocks based on the `hxa079` design
* **Transformer Layers:** Placed at strategic depths to enhance long-context performance
* **Hybrid Design:**

  * RWKV provides temporal decay and efficient recurrent-style state handling
  * NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding
* **LoRA Customization:**

  * Rank Decay: 448
  * ICLR: 192
  * Value Residual Mix: 128
  * Key Residual Mix: 128
  * Gate: 576
* **RoPE Usage:** Enabled (`use_rope: true`), aligning positional encoding with RWKV blocks

---

## Key Hyperparameters

* Hidden Size: 5120
* Intermediate Size: 27,648
* Head Dimension: 128
* Attention Heads: 80
* Key/Value Heads: 8
* Hidden Layers: 64
* Max Position Embeddings: 524,288
* Activation: SiLU
* Dropout: 0.1 (residual & attention)
* Bias: Disabled for MLP & Attention Output

---


## Evaluation

Performance evaluation is ongoing. The model shows promising results in:
-   Maintaining base model capabilities while achieving linear attention efficiency
-   Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
-   Competitive performance on standard language modeling benchmarks
-   mmlu: 78.39%(Base 82.41%)
-   gsm8k: 86.88%(Base93.93%) with gentoken=2048
-   passkey 130k+(Base 500k)

## Usage with RWKV-Infer
-   **RWKV-Infer** Triton based Hybrid RWKV Inference engine, can be check at: [https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F](https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F)


## Usage with Hugging Face Transformers

need install flash-linear-attention
```bash
pip install flash-linear-attention
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """There is a very famous song that I recall by the singer's surname as Astley.
 I can't remember the name or the youtube URL that people use to link as an example url.
 What's song name?"""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

```



## Code Repositories

-   **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
-   **ARWKV Project Code** The ARWKV original training code, can be found at: [https://github.com/yynil/RWKVInside](https://github.com/yynil/RWKVInside)
-   **Specific Training Code (OpenMOSE):** The training code for this particular model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.)

## Model Card Contact

OpenMOSE - 2025