๐ฎ Qwen2.5-3B-2048-Player ๐ฏ
๐ Model Description
This model is a fine-tuned version of Qwen2.5-3B-Instruct that has learned to play the addictive 2048 game through Grouped Relative Policy Optimization (GRPO)!
๐ฏ What makes it special?
- ๐ง Smart Strategy: Trained to maximize tile combinations and reach the elusive 2048 tile
- ๐ Efficient: Uses 4-bit quantization via Unsloth for faster inference
- ๐ Reward-Optimized: Learned through actual gameplay with rewards based on max tile value and total board score
- ๐ฎ XML Output: Returns moves in clean XML format:
<move>left</move>
๐๏ธ Model Details
- Base Model: Qwen/Qwen2.5-3B-Instruct
- Training Framework: Unsloth + OpenPipe
- Training Method: GRPO (Grouped Relative Policy Optimization)
- Quantization: 4-bit (via Unsloth)
- Context Length: 8192 tokens
- License: Apache-2.0
๐ Quick Start
import torch
from unsloth import FastLanguageModel
# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="justinj92/Qwen2.5-3B-2048Player",
max_seq_length=8192,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
# Game setup
messages = [
{
"role": "system",
"content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>"
},
{
"role": "user",
"content": "2 | 4 | _ | _\n"
"_ | 2 | _ | _\n"
"_ | _ | _ | _\n"
"_ | _ | _ | _"
}
]
# Generate move
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response) # <move>left</move>
๐ Training Details
Training Hardware
Cloud - Azure VM - Standard_NC40ads_H100_v5 Region - North Europe
Training Strategy
The model was trained using GRPO with trajectories collected from self-play:
- Episodes per step: 18 trajectories
- Training steps: 10 iterations
- Learning rate: 3e-5
- Reward function:
- Logarithmically scaled max tile value (80% weight)
- Logarithmically scaled total board value (20% weight)
- 2x bonus for reaching 2048!
Board Representation
The game board is represented as a simple text grid:
2 | 4 | 8 | 16
32 | 64 | 128 | 256
512 | 1024 | _ | 2
4 | 8 | 16 | 32
Where _
represents empty cells.
๐ฎ Performance
The model learns to:
- โ Prioritize corner strategies
- โ Build monotonic rows/columns
- โ Avoid getting stuck with unmergeable tiles
- โ Plan multiple moves ahead
๐ ๏ธ Technical Specifications
- Memory Requirements: ~4GB VRAM (4-bit quantized)
- Inference Speed: Fast thanks to Unsloth optimizations
- Compatible GPUs: Works great on T4, better on newer GPUs
๐ Citation
If you use this model, please consider citing:
@misc{qwen2048player2025,
title={Qwen2.5-3B-2048-Player: A GRPO-trained 2048 Game Agent},
author={JustinJ},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/justinj92/Qwen2.5-3B-2048Player}}
}
๐ค Acknowledgments
- Thanks to the Qwen team for the amazing base model
- Unsloth for efficient fine-tuning
- OpenPipe for GRPO implementation
๐ง Contact
Feel free to open an issue on the model repository for questions or feedback!
Happy sliding! May all your tiles merge smoothly! ๐ฎโจ
Framework versions
- PEFT 0.15.2
- Downloads last month
- 38
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for justinj92/Qwen2.5-3B-2048Player
Base model
Qwen/Qwen2.5-3B