--- library_name: stable-baselines3 tags: - LunarLander-v2 - deep-reinforcement-learning - reinforcement-learning - stable-baselines3 model-index: - name: PPO results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: LunarLander-v2 type: LunarLander-v2 metrics: - type: mean_reward value: 265.65 +/- 24.86 name: mean_reward verified: false license: mit language: - en --- # PPO-LunarLander-v2 A Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from Gymnasium. ## Model Details ### Description This model is a Deep Reinforcement Learning agent using the PPO algorithm, trained to successfully land the lunar module in OpenAI/Gymnasium's LunarLander-v2 environment. The agent learns to control the lander's engines to achieve safe landing with optimal fuel usage. - **Algorithm**: PPO (Proximal Policy Optimization) - **Framework**: Stable Baselines3 - **Environment**: [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/) - **Training Timesteps**: 1,000,000 - **Input**: 8-dimensional state vector (position, velocity, angles, leg contacts) - **Output**: 4 discrete actions (do nothing, left engine, main engine, right engine) ## Intended Use - Research in Deep Reinforcement Learning - Benchmarking RL algorithms - Educational purposes (Hugging Face Deep RL Course) - Base model for transfer learning in similar environments ## Usage ### Installation ```python !pip install stable-baselines3 gymnasium huggingface_sb3 shimmy ``` ### Load and Run the Model ```python from huggingface_sb3 import load_from_hub from stable_baselines3 import PPO import gymnasium as gym # Download model repo_id = "ashaduzzaman/ppo-LunarLander-v2" # Replace with your repo filename = "ppo-LunarLander-v2.zip" checkpoint = load_from_hub(repo_id, filename) # Load model with compatibility settings custom_objects = { "learning_rate": 0.0, "lr_schedule": lambda _: 0.0, "clip_range": lambda _: 0.0, } model = PPO.load(checkpoint, custom_objects=custom_objects) # Evaluate from stable_baselines3.common.evaluation import evaluate_policy eval_env = gym.make("LunarLander-v2") mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10) print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}") ``` ## Training ### Hyperparameters ```python PPO( policy="MlpPolicy", n_steps=1024, batch_size=64, n_epochs=4, gamma=0.999, gae_lambda=0.98, ent_coef=0.01, learning_rate=0.00025, verbose=1 ) ``` ### Training Configuration - **Total Timesteps**: 1,000,000 - **Parallel Environments**: 16 - **Optimizer**: Adam - **Policy Network**: 2 hidden layers (64 units each) - **Activation**: Tanh - **Training Hardware**: NVIDIA Tesla T4 GPU ## Evaluation | Metric | Value | |-----------------|--------| | Mean Reward | 257.67 | | Std Reward | 24.70 | | Success Rate | 100% | | Avg Episode Length | 270 steps | ## Environmental Impact **Carbon Emissions Estimate** Training done on Google Colab: - **Hardware Type**: NVIDIA T4 GPU - **Hours Used**: 0.5 - **Cloud Provider**: Google Cloud - **Compute Region**: us-west1 - **Carbon Emitted**: ~0.03 kgCO₂eq ## Credits - Developed as part of [Hugging Face Deep RL Course](https://huggingface.co/deep-rl-course) - Base implementation using [Stable Baselines3](https://stable-baselines3.readthedocs.io/) - Environment by [Gymnasium](https://gymnasium.farama.org/) ## License MIT License - Free for academic and commercial use. See [LICENSE](https://opensource.org/licenses/MIT) for details. --- **Leaderboard Submission** `result = mean_reward - std_reward = 257.67 - 24.70 = 232.97`