๐ŸŽฎ Qwen2.5-3B-2048-Player ๐ŸŽฏ

A Qwen2.5-3B model fine-tuned to master the 2048 game using GRPO! ๐Ÿš€

Base Model Training Method Framework

๐ŸŒŸ Model Description

This model is a fine-tuned version of Qwen2.5-3B-Instruct that has learned to play the addictive 2048 game through Grouped Relative Policy Optimization (GRPO)!

๐ŸŽฏ What makes it special?

  • ๐Ÿง  Smart Strategy: Trained to maximize tile combinations and reach the elusive 2048 tile
  • ๐Ÿš€ Efficient: Uses 4-bit quantization via Unsloth for faster inference
  • ๐Ÿ“ˆ Reward-Optimized: Learned through actual gameplay with rewards based on max tile value and total board score
  • ๐ŸŽฎ XML Output: Returns moves in clean XML format: <move>left</move>

๐Ÿ—๏ธ Model Details

  • Base Model: Qwen/Qwen2.5-3B-Instruct
  • Training Framework: Unsloth + OpenPipe
  • Training Method: GRPO (Grouped Relative Policy Optimization)
  • Quantization: 4-bit (via Unsloth)
  • Context Length: 8192 tokens
  • License: Apache-2.0

๐Ÿš€ Quick Start

import torch
from unsloth import FastLanguageModel

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="justinj92/Qwen2.5-3B-2048Player",
    max_seq_length=8192,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Game setup
messages = [
    {
        "role": "system",
        "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>"
    },
    {
        "role": "user", 
        "content": "2    | 4    | _    | _\n"
                   "_    | 2    | _    | _\n"
                   "_    | _    | _    | _\n"
                   "_    | _    | _    | _"
    }
]

# Generate move
inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)  # <move>left</move>

๐Ÿ“Š Training Details

Training Hardware

Cloud - Azure VM - Standard_NC40ads_H100_v5 Region - North Europe

Training Strategy

The model was trained using GRPO with trajectories collected from self-play:

  • Episodes per step: 18 trajectories
  • Training steps: 10 iterations
  • Learning rate: 3e-5
  • Reward function:
    • Logarithmically scaled max tile value (80% weight)
    • Logarithmically scaled total board value (20% weight)
    • 2x bonus for reaching 2048!

Board Representation

The game board is represented as a simple text grid:

2    | 4    | 8    | 16
32   | 64   | 128  | 256
512  | 1024 | _    | 2
4    | 8    | 16   | 32

Where _ represents empty cells.

๐ŸŽฎ Performance

The model learns to:

  • โœ… Prioritize corner strategies
  • โœ… Build monotonic rows/columns
  • โœ… Avoid getting stuck with unmergeable tiles
  • โœ… Plan multiple moves ahead

๐Ÿ› ๏ธ Technical Specifications

  • Memory Requirements: ~4GB VRAM (4-bit quantized)
  • Inference Speed: Fast thanks to Unsloth optimizations
  • Compatible GPUs: Works great on T4, better on newer GPUs

๐Ÿ“ Citation

If you use this model, please consider citing:

@misc{qwen2048player2025,
  title={Qwen2.5-3B-2048-Player: A GRPO-trained 2048 Game Agent},
  author={JustinJ},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/justinj92/Qwen2.5-3B-2048Player}}
}

๐Ÿค Acknowledgments

๐Ÿ“ง Contact

Feel free to open an issue on the model repository for questions or feedback!


Happy sliding! May all your tiles merge smoothly! ๐ŸŽฎโœจ

Framework versions

  • PEFT 0.15.2
Downloads last month
38
Safetensors
Model size
3.09B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for justinj92/Qwen2.5-3B-2048Player

Base model

Qwen/Qwen2.5-3B
Finetuned
(183)
this model