library_name: stable-baselines3
tags:
- PandaReachDense-v3
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
model-index:
- name: A2C
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: PandaReachDense-v3
type: PandaReachDense-v3
metrics:
- type: mean_reward
value: '-0.24 +/- 0.13'
name: mean_reward
verified: false
A2C Agent for PandaReachDense-v3
Model Description
This repository contains a trained Advantage Actor-Critic (A2C) reinforcement learning agent designed to solve the PandaReachDense-v3 environment from PyBullet Gym. The agent has been trained using the stable-baselines3 library to perform robotic arm reaching tasks with the Franka Emika Panda robot.
Model Details
- Algorithm: A2C (Advantage Actor-Critic)
- Environment: PandaReachDense-v3 (PyBullet)
- Framework: Stable-Baselines3
- Task Type: Continuous Control
- Action Space: Continuous (7-dimensional joint control)
- Observation Space: Multi-dimensional state representation including joint positions, velocities, and target coordinates
Environment Overview
PandaReachDense-v3 is a robotic manipulation task where:
- Objective: Control a 7-DOF Franka Panda robotic arm to reach target positions
- Reward Structure: Dense reward based on distance to target and achievement of goal
- Difficulty: Continuous control with high-dimensional action and observation spaces
Performance
The trained A2C agent achieves the following performance metrics:
- Mean Reward: -0.24 ± 0.13
- Performance Context: This represents strong performance for this environment, where typical untrained baselines often achieve rewards around -3.5
- Training Stability: The relatively low standard deviation indicates consistent performance across evaluation episodes
Performance Analysis
The achieved mean reward of -0.37 demonstrates significant improvement over random baselines. In the PandaReachDense-v3 environment, rewards are typically negative and approach zero as the agent becomes more proficient at reaching targets. The substantial improvement from the baseline of approximately -3.5 indicates the agent has successfully learned to:
- Navigate the robotic arm efficiently toward target positions
- Minimize unnecessary movements and energy expenditure
- Achieve consistent reaching behavior across varied target locations
Usage
Installation Requirements
pip install stable-baselines3[extra]
pip install huggingface-sb3
pip install pybullet
pip install gym
Loading and Using the Model
import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub
# Load the trained model
model = load_from_hub(
repo_id="Adilbai/a2c-PandaReachDense-v3",
filename="a2c-PandaReachDense-v3.zip"
)
# Create the environment
env = gym.make("PandaReachDense-v3")
# Evaluate the model
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render() # Optional: visualize the agent
if done:
obs = env.reset()
env.close()
Advanced Usage: Fine-tuning
import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub
# Load the pre-trained model
model = load_from_hub(
repo_id="Adilbai/a2c-PandaReachDense-v3",
filename="a2c-PandaReachDense-v3.zip"
)
# Create environment for fine-tuning
env = gym.make("PandaReachDense-v3")
# Continue training (fine-tuning)
model.set_env(env)
model.learn(total_timesteps=100000)
# Save the fine-tuned model
model.save("fine_tuned_a2c_panda")
Evaluation Script
import gym
import numpy as np
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub
def evaluate_model(model, env, num_episodes=10):
"""Evaluate the model performance over multiple episodes"""
episode_rewards = []
for episode in range(num_episodes):
obs = env.reset()
episode_reward = 0
done = False
while not done:
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
episode_reward += reward
episode_rewards.append(episode_reward)
print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
mean_reward = np.mean(episode_rewards)
std_reward = np.std(episode_rewards)
print(f"\nEvaluation Results:")
print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
return episode_rewards
# Load and evaluate the model
model = load_from_hub(
repo_id="Adilbai/a2c-PandaReachDense-v3",
filename="a2c-PandaReachDense-v3.zip"
)
env = gym.make("PandaReachDense-v3")
rewards = evaluate_model(model, env, num_episodes=20)
env.close()
Training Information
Hyperparameters
The model was trained using A2C with the following key characteristics:
- Policy: Multi-layer perceptron (MLP) for both actor and critic networks
- Environment: PandaReachDense-v3 with dense reward shaping
- Training Framework: Stable-Baselines3
Training Environment
- Observation Space: Continuous state representation including:
- Joint positions and velocities
- End-effector position
- Target position
- Distance to target
- Action Space: 7-dimensional continuous control (joint torques/positions)
- Reward Function: Dense reward based on distance to target with sparse completion bonus
Limitations and Considerations
- Environment Specificity: Model is specifically trained for PandaReachDense-v3 and may not generalize to other robotic tasks
- Simulation Gap: Trained in simulation; real-world deployment would require domain adaptation
- Deterministic Evaluation: Performance metrics based on deterministic policy evaluation
- Hardware Requirements: Real-time inference requires modest computational resources
Citation
If you use this model in your research, please cite:
@misc{a2c_panda_reach_2024,
title={A2C Agent for PandaReachDense-v3},
author={Adilbai},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Adilbai/a2c-PandaReachDense-v3}}
}
License
This model is distributed under the MIT License. See the repository for full license details.