GPT-OSS 2048 Strategy Generator (GRPO Fine-tuned)
This is a fine-tuned version of unsloth/gpt-oss-20b trained to generate Python strategies for the 2048 game using GRPO (Generative Reward-based Policy Optimization).
GIST
https://gist.github.com/bigsnarfdude/d444c1c9e6cf5b7377df22ea97eab10d
Model Description
- Base Model: unsloth/gpt-oss-20b (20B parameters, 4-bit quantized)
- Training Method: GRPO reinforcement learning
- Task: Generate Python functions that play 2048 optimally
- Training Steps: 1000
- Performance: 60% win rate achieving the 2048 tile with average score of 27,138
Training Details
Architecture
- LoRA Configuration:
- Rank: 4
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Max sequence length: 768 tokens
GRPO Parameters
- Learning rate: 5e-5
- Batch size: 2
- Weight decay: 0.01
- Temperature: 1.0
- Max training steps: 1000
Reward Functions
The model was trained with three reward components:
- function_works - Generated code executes without errors
- no_cheating - No forbidden modules or direct game state manipulation
- strategy_succeeds - Achieves high scores and wins games
Training Performance
Checkpoint | Win Rate | Avg Score | Max Tile |
---|---|---|---|
100 | 0.0% | 5,674 | 512 |
900 | 100.0% | 22,794 | 2048 |
1000 | 60.0% | 27,138 | 2048 |
The model shows a dramatic learning curve, achieving 100% win rate at checkpoint 900 and maintaining strong performance through checkpoint 1000.
Usage
Loading the Model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="vincentoh/gpt-oss-2048-gpro",
max_seq_length=768,
dtype=None,
load_in_4bit=True,
)
# Set up for inference
FastLanguageModel.for_inference(model)
Generating Strategies
prompt = """Create a Python function called `strategy` that takes a 2D board as input and returns a move direction ('W', 'A', 'S', or 'D') to play the 2048 game optimally.
Requirements:
- Input: board (list of lists representing the game state)
- Output: single character string ('W' for up, 'A' for left, 'S' for down, 'D' for right)
- Goal: Achieve the 2048 tile with high score
Example usage:
```python
board = [[2, 0, 0, 0], [0, 4, 0, 0], [0, 0, 8, 0], [0, 0, 0, 16]]
move = strategy(board)
Implement the strategy function:
"""
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)
Example Generated Strategy
Here's an example of a winning strategy generated by this model (60% win rate):
def strategy(board):
# Find reachable moves
moves = []
for i, row in enumerate(board):
for j, val in enumerate(row):
if val == 0:
# Check if we can move a tile into this empty spot
if i > 0 and board[i-1][j] != 0:
moves.append("S")
if i < len(board)-1 and board[i+1][j] != 0:
moves.append("W")
if j > 0 and board[i][j-1] != 0:
moves.append("D")
if j < len(row)-1 and board[i][j+1] != 0:
moves.append("A")
# Prefer moving towards top-left corner
if "W" in moves: return "W"
if "A" in moves: return "A"
if "D" in moves: return "D"
if "S" in moves: return "S"
return "W"
Model Card
- Developed by: Vincent Oh
- Model type: Causal Language Model (GptOssForCausalLM)
- Language: English
- License: Apache 2.0
- Finetuned from: unsloth/gpt-oss-20b
Intended Use
This model is designed for:
- Generating 2048 game playing strategies
- Research in reinforcement learning for code generation
- Educational purposes in game AI development
- Benchmarking LLM code generation capabilities
Limitations
- Focused specifically on 2048 game strategies
- Performance may vary on different board sizes (trained on 6x6 boards)
- Generated code should be validated before execution
- Requires GPU with 20GB+ VRAM for full model inference
Hardware Requirements
- Recommended: NVIDIA GPU with 20GB+ VRAM (e.g., RTX 4090, A100)
- Minimum: 32GB system RAM for CPU inference (slow)
- Storage: 13GB for full model weights
Citation
If you use this model in your research, please cite:
@misc{gpt-oss-2048-grpo,
author = {Vincent Oh},
title = {GPT-OSS 2048 Strategy Generator},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/vincentoh/gpt-oss-2048-gpro}},
note = {Fine-tuned with GRPO reinforcement learning}
}
Training Infrastructure
- GPU: NVIDIA RTX 4070 Ti Super 16GB VRAM
- Training Duration: ~12 hours for 1000 steps
- Framework: Unsloth + TRL + Transformers
Acknowledgments
- Base model: unsloth/gpt-oss-20b
- Training framework: Unsloth
- GRPO implementation: TRL (Transformer Reinforcement Learning)
Contact
For questions or issues, please open an issue on the model repository or contact the author.
- Downloads last month
- 22