File size: 4,269 Bytes
3776c5d dd60b35 1eefc18 dd60b35 3a4be1d dd60b35 84b81e1 dd60b35 ab1ef37 dd60b35 ab1ef37 0f51424 ab1ef37 af76cb8 ab1ef37 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: apache-2.0
base_model:
- ByteDance-Seed/Seed-OSS-36B-Instruct
---
# RWKV-Seed-OSS-36B-hxa079
**Acknowledgment**
This project received computational resources and technical support from **Recursal.AI**. I'm deeply grateful for their support!
This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the **RADLADS** method.
---
## Model Overview
* **Model Name:** RWKV-Seed-OSS-36B-hxa079
* **Architecture:** RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention
* **Base Model:** ByteDance-Seed/Seed-OSS-36B-Instruct
* **Model Revision:** alpha
* **Parameters:** ~37.1B
* **Context Window (Passkey):** 130k
---
## Architecture Details
* **RWKV Layers:** Interleaved RWKV blocks based on the `hxa079` design
* **Transformer Layers:** Placed at strategic depths to enhance long-context performance
* **Hybrid Design:**
* RWKV provides temporal decay and efficient recurrent-style state handling
* NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding
* **LoRA Customization:**
* Rank Decay: 448
* ICLR: 192
* Value Residual Mix: 128
* Key Residual Mix: 128
* Gate: 576
* **RoPE Usage:** Enabled (`use_rope: true`), aligning positional encoding with RWKV blocks
---
## Key Hyperparameters
* Hidden Size: 5120
* Intermediate Size: 27,648
* Head Dimension: 128
* Attention Heads: 80
* Key/Value Heads: 8
* Hidden Layers: 64
* Max Position Embeddings: 524,288
* Activation: SiLU
* Dropout: 0.1 (residual & attention)
* Bias: Disabled for MLP & Attention Output
---
## Evaluation
Performance evaluation is ongoing. The model shows promising results in:
- Maintaining base model capabilities while achieving linear attention efficiency
- Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
- Competitive performance on standard language modeling benchmarks
- mmlu: 78.39%(Base 82.41%)
- gsm8k: 86.88%(Base93.93%) with gentoken=2048
- passkey 130k+(Base 500k)
## Usage with RWKV-Infer
- **RWKV-Infer** Triton based Hybrid RWKV Inference engine, can be check at: [https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F](https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F)
## Usage with Hugging Face Transformers
need install flash-linear-attention
```bash
pip install flash-linear-attention
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """There is a very famous song that I recall by the singer's surname as Astley.
I can't remember the name or the youtube URL that people use to link as an example url.
What's song name?"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
## Code Repositories
- **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
- **ARWKV Project Code** The ARWKV original training code, can be found at: [https://github.com/yynil/RWKVInside](https://github.com/yynil/RWKVInside)
- **Specific Training Code (OpenMOSE):** The training code for this particular model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.)
## Model Card Contact
OpenMOSE - 2025
|