|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- ByteDance-Seed/Seed-OSS-36B-Instruct |
|
|
--- |
|
|
|
|
|
# RWKV-Seed-OSS-36B-hxa079 |
|
|
|
|
|
**Acknowledgment** |
|
|
|
|
|
This project received computational resources and technical support from **Recursal.AI**. I'm deeply grateful for their support! |
|
|
|
|
|
This is an experimental model that converts most of the Transformer LLM to RWKV linear attention based on the **RADLADS** method. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
* **Model Name:** RWKV-Seed-OSS-36B-hxa079 |
|
|
* **Architecture:** RWKV “hxa079+” hybrid — RWKV-Attention strategically interleaved with NoPE FullAttention |
|
|
* **Base Model:** ByteDance-Seed/Seed-OSS-36B-Instruct |
|
|
* **Model Revision:** alpha |
|
|
* **Parameters:** ~37.1B |
|
|
* **Context Window (Passkey):** 130k |
|
|
|
|
|
--- |
|
|
|
|
|
## Architecture Details |
|
|
|
|
|
* **RWKV Layers:** Interleaved RWKV blocks based on the `hxa079` design |
|
|
* **Transformer Layers:** Placed at strategic depths to enhance long-context performance |
|
|
* **Hybrid Design:** |
|
|
|
|
|
* RWKV provides temporal decay and efficient recurrent-style state handling |
|
|
* NoPE (No Positional Embedding) FullAttention augments global reasoning without redundant positional encoding |
|
|
* **LoRA Customization:** |
|
|
|
|
|
* Rank Decay: 448 |
|
|
* ICLR: 192 |
|
|
* Value Residual Mix: 128 |
|
|
* Key Residual Mix: 128 |
|
|
* Gate: 576 |
|
|
* **RoPE Usage:** Enabled (`use_rope: true`), aligning positional encoding with RWKV blocks |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Hyperparameters |
|
|
|
|
|
* Hidden Size: 5120 |
|
|
* Intermediate Size: 27,648 |
|
|
* Head Dimension: 128 |
|
|
* Attention Heads: 80 |
|
|
* Key/Value Heads: 8 |
|
|
* Hidden Layers: 64 |
|
|
* Max Position Embeddings: 524,288 |
|
|
* Activation: SiLU |
|
|
* Dropout: 0.1 (residual & attention) |
|
|
* Bias: Disabled for MLP & Attention Output |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Performance evaluation is ongoing. The model shows promising results in: |
|
|
- Maintaining base model capabilities while achieving linear attention efficiency |
|
|
- Significantly improved needle-in-haystack task performance compared to pure RWKV architectures |
|
|
- Competitive performance on standard language modeling benchmarks |
|
|
- mmlu: 78.39%(Base 82.41%) |
|
|
- gsm8k: 86.88%(Base93.93%) with gentoken=2048 |
|
|
- passkey 130k+(Base 500k) |
|
|
|
|
|
## Usage with RWKV-Infer |
|
|
- **RWKV-Infer** Triton based Hybrid RWKV Inference engine, can be check at: [https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F](https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F) |
|
|
|
|
|
|
|
|
## Usage with Hugging Face Transformers |
|
|
|
|
|
need install flash-linear-attention |
|
|
```bash |
|
|
pip install flash-linear-attention |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "OpenMOSE/RWKV-Seed-OSS-36B-hxa079" |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
prompt = """There is a very famous song that I recall by the singer's surname as Astley. |
|
|
I can't remember the name or the youtube URL that people use to link as an example url. |
|
|
What's song name?""" |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": prompt}, |
|
|
] |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate(**model_inputs, max_new_tokens=512) |
|
|
generated_ids = [ |
|
|
output_ids[len(input_ids) :] |
|
|
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
|
] |
|
|
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Code Repositories |
|
|
|
|
|
- **RADLADS Project Code:** The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS) |
|
|
- **ARWKV Project Code** The ARWKV original training code, can be found at: [https://github.com/yynil/RWKVInside](https://github.com/yynil/RWKVInside) |
|
|
- **Specific Training Code (OpenMOSE):** The training code for this particular model is available at: [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (Note: this repository is still under development and may contain bugs.) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
OpenMOSE - 2025 |
|
|
|
|
|
|
|
|
|