PPO-LunarLander-v2

A Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from Gymnasium.

Model Details

Description

This model is a Deep Reinforcement Learning agent using the PPO algorithm, trained to successfully land the lunar module in OpenAI/Gymnasium's LunarLander-v2 environment. The agent learns to control the lander's engines to achieve safe landing with optimal fuel usage.

  • Algorithm: PPO (Proximal Policy Optimization)
  • Framework: Stable Baselines3
  • Environment: LunarLander-v2
  • Training Timesteps: 1,000,000
  • Input: 8-dimensional state vector (position, velocity, angles, leg contacts)
  • Output: 4 discrete actions (do nothing, left engine, main engine, right engine)

Intended Use

  • Research in Deep Reinforcement Learning
  • Benchmarking RL algorithms
  • Educational purposes (Hugging Face Deep RL Course)
  • Base model for transfer learning in similar environments

Usage

Installation

!pip install stable-baselines3 gymnasium huggingface_sb3 shimmy

Load and Run the Model

from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
import gymnasium as gym

# Download model
repo_id = "ashaduzzaman/ppo-LunarLander-v2"  # Replace with your repo
filename = "ppo-LunarLander-v2.zip"
checkpoint = load_from_hub(repo_id, filename)

# Load model with compatibility settings
custom_objects = {
    "learning_rate": 0.0,
    "lr_schedule": lambda _: 0.0,
    "clip_range": lambda _: 0.0,
}
model = PPO.load(checkpoint, custom_objects=custom_objects)

# Evaluate
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

Training

Hyperparameters

PPO(
    policy="MlpPolicy",
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    learning_rate=0.00025,
    verbose=1
)

Training Configuration

  • Total Timesteps: 1,000,000
  • Parallel Environments: 16
  • Optimizer: Adam
  • Policy Network: 2 hidden layers (64 units each)
  • Activation: Tanh
  • Training Hardware: NVIDIA Tesla T4 GPU

Evaluation

Metric Value
Mean Reward 257.67
Std Reward 24.70
Success Rate 100%
Avg Episode Length 270 steps

Environmental Impact

Carbon Emissions Estimate
Training done on Google Colab:

  • Hardware Type: NVIDIA T4 GPU
  • Hours Used: 0.5
  • Cloud Provider: Google Cloud
  • Compute Region: us-west1
  • Carbon Emitted: ~0.03 kgCO₂eq

Credits

License

MIT License - Free for academic and commercial use. See LICENSE for details.


Leaderboard Submission
result = mean_reward - std_reward = 257.67 - 24.70 = 232.97

Downloads last month
2
Video Preview
loading

Evaluation results