GPT-OSS 2048 Strategy Generator (GRPO Fine-tuned)

This is a fine-tuned version of unsloth/gpt-oss-20b trained to generate Python strategies for the 2048 game using GRPO (Generative Reward-based Policy Optimization).

GIST

https://gist.github.com/bigsnarfdude/d444c1c9e6cf5b7377df22ea97eab10d

Model Description

Base Model: unsloth/gpt-oss-20b (20B parameters, 4-bit quantized)
Training Method: GRPO reinforcement learning
Task: Generate Python functions that play 2048 optimally
Training Steps: 1000
Performance: 60% win rate achieving the 2048 tile with average score of 27,138

Training Details

Architecture

LoRA Configuration:
- Rank: 4
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Max sequence length: 768 tokens

GRPO Parameters

Learning rate: 5e-5
Batch size: 2
Weight decay: 0.01
Temperature: 1.0
Max training steps: 1000

Reward Functions

The model was trained with three reward components:

function_works - Generated code executes without errors
no_cheating - No forbidden modules or direct game state manipulation
strategy_succeeds - Achieves high scores and wins games

Training Performance

Checkpoint	Win Rate	Avg Score	Max Tile
100	0.0%	5,674	512
900	100.0%	22,794	2048
1000	60.0%	27,138	2048

The model shows a dramatic learning curve, achieving 100% win rate at checkpoint 900 and maintaining strong performance through checkpoint 1000.

Usage

Loading the Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="vincentoh/gpt-oss-2048-gpro",
    max_seq_length=768,
    dtype=None,
    load_in_4bit=True,
)

# Set up for inference
FastLanguageModel.for_inference(model)

Generating Strategies

prompt = """Create a Python function called `strategy` that takes a 2D board as input and returns a move direction ('W', 'A', 'S', or 'D') to play the 2048 game optimally.

Requirements:
- Input: board (list of lists representing the game state)
- Output: single character string ('W' for up, 'A' for left, 'S' for down, 'D' for right)
- Goal: Achieve the 2048 tile with high score

Example usage:
```python
board = [[2, 0, 0, 0], [0, 4, 0, 0], [0, 0, 8, 0], [0, 0, 0, 16]]
move = strategy(board)

Implement the strategy function:

"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

Example Generated Strategy

Here's an example of a winning strategy generated by this model (60% win rate):

def strategy(board):
    # Find reachable moves
    moves = []
    for i, row in enumerate(board):
        for j, val in enumerate(row):
            if val == 0:
                # Check if we can move a tile into this empty spot
                if i > 0 and board[i-1][j] != 0:
                    moves.append("S")
                if i < len(board)-1 and board[i+1][j] != 0:
                    moves.append("W")
                if j > 0 and board[i][j-1] != 0:
                    moves.append("D")
                if j < len(row)-1 and board[i][j+1] != 0:
                    moves.append("A")

    # Prefer moving towards top-left corner
    if "W" in moves: return "W"
    if "A" in moves: return "A"
    if "D" in moves: return "D"
    if "S" in moves: return "S"
    return "W"

Model Card

Developed by: Vincent Oh
Model type: Causal Language Model (GptOssForCausalLM)
Language: English
License: Apache 2.0
Finetuned from: unsloth/gpt-oss-20b

Intended Use

This model is designed for:

Generating 2048 game playing strategies
Research in reinforcement learning for code generation
Educational purposes in game AI development
Benchmarking LLM code generation capabilities

Limitations

Focused specifically on 2048 game strategies
Performance may vary on different board sizes (trained on 6x6 boards)
Generated code should be validated before execution
Requires GPU with 20GB+ VRAM for full model inference

Hardware Requirements

Recommended: NVIDIA GPU with 20GB+ VRAM (e.g., RTX 4090, A100)
Minimum: 32GB system RAM for CPU inference (slow)
Storage: 13GB for full model weights

Citation

If you use this model in your research, please cite:

@misc{gpt-oss-2048-grpo,
  author = {Vincent Oh},
  title = {GPT-OSS 2048 Strategy Generator},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/vincentoh/gpt-oss-2048-gpro}},
  note = {Fine-tuned with GRPO reinforcement learning}
}

Training Infrastructure

GPU: NVIDIA RTX 4070 Ti Super 16GB VRAM
Training Duration: ~12 hours for 1000 steps
Framework: Unsloth + TRL + Transformers

Acknowledgments

Base model: unsloth/gpt-oss-20b
Training framework: Unsloth
GRPO implementation: TRL (Transformer Reinforcement Learning)

Contact

For questions or issues, please open an issue on the model repository or contact the author.

Downloads last month: 22

Safetensors

Model size

22B params

Tensor type

BF16