GPT-OSS 2048 Strategy Generator (GRPO Fine-tuned)

This is a fine-tuned version of unsloth/gpt-oss-20b trained to generate Python strategies for the 2048 game using GRPO (Generative Reward-based Policy Optimization).

GIST

https://gist.github.com/bigsnarfdude/d444c1c9e6cf5b7377df22ea97eab10d

Model Description

  • Base Model: unsloth/gpt-oss-20b (20B parameters, 4-bit quantized)
  • Training Method: GRPO reinforcement learning
  • Task: Generate Python functions that play 2048 optimally
  • Training Steps: 1000
  • Performance: 60% win rate achieving the 2048 tile with average score of 27,138

Training Details

Architecture

  • LoRA Configuration:
    • Rank: 4
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • Max sequence length: 768 tokens

GRPO Parameters

  • Learning rate: 5e-5
  • Batch size: 2
  • Weight decay: 0.01
  • Temperature: 1.0
  • Max training steps: 1000

Reward Functions

The model was trained with three reward components:

  1. function_works - Generated code executes without errors
  2. no_cheating - No forbidden modules or direct game state manipulation
  3. strategy_succeeds - Achieves high scores and wins games

Training Performance

Checkpoint Win Rate Avg Score Max Tile
100 0.0% 5,674 512
900 100.0% 22,794 2048
1000 60.0% 27,138 2048

The model shows a dramatic learning curve, achieving 100% win rate at checkpoint 900 and maintaining strong performance through checkpoint 1000.

Usage

Loading the Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="vincentoh/gpt-oss-2048-gpro",
    max_seq_length=768,
    dtype=None,
    load_in_4bit=True,
)

# Set up for inference
FastLanguageModel.for_inference(model)

Generating Strategies

prompt = """Create a Python function called `strategy` that takes a 2D board as input and returns a move direction ('W', 'A', 'S', or 'D') to play the 2048 game optimally.

Requirements:
- Input: board (list of lists representing the game state)
- Output: single character string ('W' for up, 'A' for left, 'S' for down, 'D' for right)
- Goal: Achieve the 2048 tile with high score

Example usage:
```python
board = [[2, 0, 0, 0], [0, 4, 0, 0], [0, 0, 8, 0], [0, 0, 0, 16]]
move = strategy(board)

Implement the strategy function:

"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

Example Generated Strategy

Here's an example of a winning strategy generated by this model (60% win rate):

def strategy(board):
    # Find reachable moves
    moves = []
    for i, row in enumerate(board):
        for j, val in enumerate(row):
            if val == 0:
                # Check if we can move a tile into this empty spot
                if i > 0 and board[i-1][j] != 0:
                    moves.append("S")
                if i < len(board)-1 and board[i+1][j] != 0:
                    moves.append("W")
                if j > 0 and board[i][j-1] != 0:
                    moves.append("D")
                if j < len(row)-1 and board[i][j+1] != 0:
                    moves.append("A")

    # Prefer moving towards top-left corner
    if "W" in moves: return "W"
    if "A" in moves: return "A"
    if "D" in moves: return "D"
    if "S" in moves: return "S"
    return "W"

Model Card

  • Developed by: Vincent Oh
  • Model type: Causal Language Model (GptOssForCausalLM)
  • Language: English
  • License: Apache 2.0
  • Finetuned from: unsloth/gpt-oss-20b

Intended Use

This model is designed for:

  • Generating 2048 game playing strategies
  • Research in reinforcement learning for code generation
  • Educational purposes in game AI development
  • Benchmarking LLM code generation capabilities

Limitations

  • Focused specifically on 2048 game strategies
  • Performance may vary on different board sizes (trained on 6x6 boards)
  • Generated code should be validated before execution
  • Requires GPU with 20GB+ VRAM for full model inference

Hardware Requirements

  • Recommended: NVIDIA GPU with 20GB+ VRAM (e.g., RTX 4090, A100)
  • Minimum: 32GB system RAM for CPU inference (slow)
  • Storage: 13GB for full model weights

Citation

If you use this model in your research, please cite:

@misc{gpt-oss-2048-grpo,
  author = {Vincent Oh},
  title = {GPT-OSS 2048 Strategy Generator},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/vincentoh/gpt-oss-2048-gpro}},
  note = {Fine-tuned with GRPO reinforcement learning}
}

Training Infrastructure

  • GPU: NVIDIA RTX 4070 Ti Super 16GB VRAM
  • Training Duration: ~12 hours for 1000 steps
  • Framework: Unsloth + TRL + Transformers

Acknowledgments

Contact

For questions or issues, please open an issue on the model repository or contact the author.

Downloads last month
22
Safetensors
Model size
22B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support