Benjamin-eecs nielsr HF Staff commited on
Commit
17c08e2
·
verified ·
1 Parent(s): 2a6b119

feat(improve model card): add pipeline tag, library name, quickstart, and expanded details (#1)

Browse files

- Improve model card: Add pipeline tag, library name, quickstart, and expanded details (46016f8ae00b98f5ad2b637f77f0d1e4155cd206)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +67 -3
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-4B-Base
 
 
 
5
  ---
6
 
7
  # Spiral-Qwen3-4B
@@ -16,11 +18,65 @@ base_model:
16
 
17
  This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simple Negotiation) using the SPIRAL framework.
18
 
19
- <img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/framework.png" width=100%/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Citation
23
 
 
 
24
  ```latex
25
  @article{liu2025spiral,
26
  title={SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
@@ -29,4 +85,12 @@ This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simp
29
  year={2025},
30
  url={https://arxiv.org/abs/2506.24119}
31
  }
32
- ```
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen3-4B-Base
4
+ license: apache-2.0
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
  ---
8
 
9
  # Spiral-Qwen3-4B
 
18
 
19
  This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simple Negotiation) using the SPIRAL framework.
20
 
21
+ Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on expert-curated problem-answer pairs and domain-specific reward engineering.
22
+
23
+ We introduce SPIRAL, a self-play framework where models learn by playing **multi-turn, zero-sum games against continuously improving versions of themselves**, eliminating the need for human supervision. Through zero-sum self-play, SPIRAL generates an **_infinite curriculum_** of progressively challenging problems as models must constantly adapt to stronger opponents.
24
+
25
+ Applying SPIRAL to Qwen3 base models in two-player zero-sum text games, we observe the agents develop advanced reasoning strategies to win the competitive game. Furthermore, the trained models show substantial gains on a range of math and general reasoning benchmarks. These results suggest that self-play in zero-sum games can naturally induce transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
26
+
27
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/teaser-1.png" width="100%" /></p>
28
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/fig1-1.png" width="100%" /></p>
29
+
30
+ ## Architecture
31
+
32
+ SPIRAL employs an actor-learner architecture for scalable self-play training. Parallel actors sample trajectories from a diverse set of games using vectorized environments. A single policy $\pi_t$ plays both roles, generating zero-sum, sparse reward game trajectories. The centralized learner processes these trajectories using Role-conditioned Advantage Estimation (RAE) to compute separate advantages, $A_0(s,a)$ and $A_1(s,a)$, for each role. These are then used for on-policy reinforcement learning updates.
33
+
34
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/framework.png" width="90%" /></p>
35
+
36
+ ## Usage (Quickstart)
37
 
38
+ You can easily load and use this model with the `transformers` library:
39
+
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
42
+ import torch
43
+
44
+ model_id = "spiral-rl/Spiral-Qwen3-4B"
45
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
46
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
47
+
48
+ # Example usage for text generation following Qwen chat template
49
+ prompt = "What is the capital of France?"
50
+ messages = [
51
+ {"role": "user", "content": prompt}
52
+ ]
53
+ text = tokenizer.apply_chat_template(
54
+ messages,
55
+ tokenize=False,
56
+ add_generation_prompt=True
57
+ )
58
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
59
+
60
+ # Using a simple generation config (adjust as needed)
61
+ generation_config = GenerationConfig(
62
+ max_new_tokens=50,
63
+ temperature=0.7,
64
+ do_sample=True,
65
+ top_p=0.9
66
+ )
67
+
68
+ outputs = model.generate(**inputs, generation_config=generation_config)
69
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
70
+ print(generated_text)
71
+ # Expected output: "What is the capital of France? Paris." (or similar)
72
+ ```
73
+
74
+ For more advanced usage, including training and evaluation scripts, please refer to the [GitHub repository](https://github.com/spiral-rl/spiral).
75
 
76
  ## Citation
77
 
78
+ If you find our work useful for your research, please consider citing:
79
+
80
  ```latex
81
  @article{liu2025spiral,
82
  title={SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
 
85
  year={2025},
86
  url={https://arxiv.org/abs/2506.24119}
87
  }
88
+ ```
89
+
90
+ ## Acknowledgements
91
+
92
+ This work is supported by [PlasticLabs](https://plasticlabs.ai/) and [Sea AI Lab](https://sail.sea.com/) for computing resources.
93
+ The language games are sampled from [TextArena](https://github.com/LeonGuertler/TextArena), a collection of competitive text-based games for language model evaluation and reinforcement learning.
94
+ The multi-agent, multi-turn RL training is implemented with 🌾 [Oat](https://github.com/sail-sg/oat), a modular and research-friendly LLM RL framework.
95
+ We did exploration on PEFT experiments using [UnstableBaselines](https://github.com/LeonGuertler/UnstableBaselines), a lightweight, LoRA-first library for fast prototyping of self-play algorithms on TextArena.
96
+ The base models are from [Qwen3](https://huggingface.co/Qwen/Qwen3-4B).