🎮 Qwen2.5-3B-2048-Player 🎯

A Qwen2.5-3B model fine-tuned to master the 2048 game using GRPO! 🚀

🌟 Model Description

This model is a fine-tuned version of Qwen2.5-3B-Instruct that has learned to play the addictive 2048 game through Grouped Relative Policy Optimization (GRPO)!

🎯 What makes it special?

🧠 Smart Strategy: Trained to maximize tile combinations and reach the elusive 2048 tile
🚀 Efficient: Uses 4-bit quantization via Unsloth for faster inference
📈 Reward-Optimized: Learned through actual gameplay with rewards based on max tile value and total board score
🎮 XML Output: Returns moves in clean XML format: <move>left</move>

🏗️ Model Details

Base Model: Qwen/Qwen2.5-3B-Instruct
Training Framework: Unsloth + OpenPipe
Training Method: GRPO (Grouped Relative Policy Optimization)
Quantization: 4-bit (via Unsloth)
Context Length: 8192 tokens
License: Apache-2.0

🚀 Quick Start

import torch
from unsloth import FastLanguageModel

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="justinj92/Qwen2.5-3B-2048Player",
    max_seq_length=8192,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Game setup
messages = [
    {
        "role": "system",
        "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>"
    },
    {
        "role": "user", 
        "content": "2    | 4    | _    | _\n"
                   "_    | 2    | _    | _\n"
                   "_    | _    | _    | _\n"
                   "_    | _    | _    | _"
    }
]

# Generate move
inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)  # <move>left</move>

📊 Training Details

Training Hardware

Cloud - Azure VM - Standard_NC40ads_H100_v5 Region - North Europe

Training Strategy

The model was trained using GRPO with trajectories collected from self-play:

Episodes per step: 18 trajectories
Training steps: 10 iterations
Learning rate: 3e-5
Reward function:
- Logarithmically scaled max tile value (80% weight)
- Logarithmically scaled total board value (20% weight)
- 2x bonus for reaching 2048!

Board Representation

The game board is represented as a simple text grid:

2    | 4    | 8    | 16
32   | 64   | 128  | 256
512  | 1024 | _    | 2
4    | 8    | 16   | 32

Where _ represents empty cells.

🎮 Performance

The model learns to:

✅ Prioritize corner strategies
✅ Build monotonic rows/columns
✅ Avoid getting stuck with unmergeable tiles
✅ Plan multiple moves ahead

🛠️ Technical Specifications

Memory Requirements: ~4GB VRAM (4-bit quantized)
Inference Speed: Fast thanks to Unsloth optimizations
Compatible GPUs: Works great on T4, better on newer GPUs

📝 Citation

If you use this model, please consider citing:

@misc{qwen2048player2025,
  title={Qwen2.5-3B-2048-Player: A GRPO-trained 2048 Game Agent},
  author={JustinJ},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/justinj92/Qwen2.5-3B-2048Player}}
}

🤝 Acknowledgments

Thanks to the Qwen team for the amazing base model
Unsloth for efficient fine-tuning
OpenPipe for GRPO implementation

📧 Contact

Feel free to open an issue on the model repository for questions or feedback!

Happy sliding! May all your tiles merge smoothly! 🎮✨

Framework versions

PEFT 0.15.2

justinj92
/

Qwen2.5-3B-2048Player