PPO-LunarLander-v2

A Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from Gymnasium.

Model Details

Description

This model is a Deep Reinforcement Learning agent using the PPO algorithm, trained to successfully land the lunar module in OpenAI/Gymnasium's LunarLander-v2 environment. The agent learns to control the lander's engines to achieve safe landing with optimal fuel usage.

Algorithm: PPO (Proximal Policy Optimization)
Framework: Stable Baselines3
Environment: LunarLander-v2
Training Timesteps: 1,000,000
Input: 8-dimensional state vector (position, velocity, angles, leg contacts)
Output: 4 discrete actions (do nothing, left engine, main engine, right engine)

Intended Use

Research in Deep Reinforcement Learning
Benchmarking RL algorithms
Educational purposes (Hugging Face Deep RL Course)
Base model for transfer learning in similar environments

Usage

Installation

!pip install stable-baselines3 gymnasium huggingface_sb3 shimmy

Load and Run the Model

from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
import gymnasium as gym

# Download model
repo_id = "ashaduzzaman/ppo-LunarLander-v2"  # Replace with your repo
filename = "ppo-LunarLander-v2.zip"
checkpoint = load_from_hub(repo_id, filename)

# Load model with compatibility settings
custom_objects = {
    "learning_rate": 0.0,
    "lr_schedule": lambda _: 0.0,
    "clip_range": lambda _: 0.0,
}
model = PPO.load(checkpoint, custom_objects=custom_objects)

# Evaluate
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

Training

Hyperparameters

PPO(
    policy="MlpPolicy",
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    learning_rate=0.00025,
    verbose=1
)

Training Configuration

Total Timesteps: 1,000,000
Parallel Environments: 16
Optimizer: Adam
Policy Network: 2 hidden layers (64 units each)
Activation: Tanh
Training Hardware: NVIDIA Tesla T4 GPU

Evaluation

Metric	Value
Mean Reward	257.67
Std Reward	24.70
Success Rate	100%
Avg Episode Length	270 steps

Environmental Impact

Carbon Emissions Estimate
Training done on Google Colab:

Hardware Type: NVIDIA T4 GPU
Hours Used: 0.5
Cloud Provider: Google Cloud
Compute Region: us-west1
Carbon Emitted: ~0.03 kgCO₂eq

Credits

Developed as part of Hugging Face Deep RL Course
Base implementation using Stable Baselines3
Environment by Gymnasium

License

MIT License - Free for academic and commercial use. See LICENSE for details.

Leaderboard Submission
result = mean_reward - std_reward = 257.67 - 24.70 = 232.97

ashaduzzaman
/

ppo-LunarLander-v2