PPO-LunarLander-v2
A Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from Gymnasium.
Model Details
Description
This model is a Deep Reinforcement Learning agent using the PPO algorithm, trained to successfully land the lunar module in OpenAI/Gymnasium's LunarLander-v2 environment. The agent learns to control the lander's engines to achieve safe landing with optimal fuel usage.
- Algorithm: PPO (Proximal Policy Optimization)
- Framework: Stable Baselines3
- Environment: LunarLander-v2
- Training Timesteps: 1,000,000
- Input: 8-dimensional state vector (position, velocity, angles, leg contacts)
- Output: 4 discrete actions (do nothing, left engine, main engine, right engine)
Intended Use
- Research in Deep Reinforcement Learning
- Benchmarking RL algorithms
- Educational purposes (Hugging Face Deep RL Course)
- Base model for transfer learning in similar environments
Usage
Installation
!pip install stable-baselines3 gymnasium huggingface_sb3 shimmy
Load and Run the Model
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
import gymnasium as gym
# Download model
repo_id = "ashaduzzaman/ppo-LunarLander-v2" # Replace with your repo
filename = "ppo-LunarLander-v2.zip"
checkpoint = load_from_hub(repo_id, filename)
# Load model with compatibility settings
custom_objects = {
"learning_rate": 0.0,
"lr_schedule": lambda _: 0.0,
"clip_range": lambda _: 0.0,
}
model = PPO.load(checkpoint, custom_objects=custom_objects)
# Evaluate
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
Training
Hyperparameters
PPO(
policy="MlpPolicy",
n_steps=1024,
batch_size=64,
n_epochs=4,
gamma=0.999,
gae_lambda=0.98,
ent_coef=0.01,
learning_rate=0.00025,
verbose=1
)
Training Configuration
- Total Timesteps: 1,000,000
- Parallel Environments: 16
- Optimizer: Adam
- Policy Network: 2 hidden layers (64 units each)
- Activation: Tanh
- Training Hardware: NVIDIA Tesla T4 GPU
Evaluation
Metric | Value |
---|---|
Mean Reward | 257.67 |
Std Reward | 24.70 |
Success Rate | 100% |
Avg Episode Length | 270 steps |
Environmental Impact
Carbon Emissions Estimate
Training done on Google Colab:
- Hardware Type: NVIDIA T4 GPU
- Hours Used: 0.5
- Cloud Provider: Google Cloud
- Compute Region: us-west1
- Carbon Emitted: ~0.03 kgCO₂eq
Credits
- Developed as part of Hugging Face Deep RL Course
- Base implementation using Stable Baselines3
- Environment by Gymnasium
License
MIT License - Free for academic and commercial use. See LICENSE for details.
Leaderboard Submissionresult = mean_reward - std_reward = 257.67 - 24.70 = 232.97
- Downloads last month
- 2
Evaluation results
- mean_reward on LunarLander-v2self-reported265.65 +/- 24.86