A2C Agent for PandaReachDense-v3

Model Description

This repository contains a trained Advantage Actor-Critic (A2C) reinforcement learning agent designed to solve the PandaReachDense-v3 environment from PyBullet Gym. The agent has been trained using the stable-baselines3 library to perform robotic arm reaching tasks with the Franka Emika Panda robot.

Model Details

  • Algorithm: A2C (Advantage Actor-Critic)
  • Environment: PandaReachDense-v3 (PyBullet)
  • Framework: Stable-Baselines3
  • Task Type: Continuous Control
  • Action Space: Continuous (7-dimensional joint control)
  • Observation Space: Multi-dimensional state representation including joint positions, velocities, and target coordinates

Environment Overview

PandaReachDense-v3 is a robotic manipulation task where:

  • Objective: Control a 7-DOF Franka Panda robotic arm to reach target positions
  • Reward Structure: Dense reward based on distance to target and achievement of goal
  • Difficulty: Continuous control with high-dimensional action and observation spaces

Performance

The trained A2C agent achieves the following performance metrics:

  • Mean Reward: -0.24 ± 0.13
  • Performance Context: This represents strong performance for this environment, where typical untrained baselines often achieve rewards around -3.5
  • Training Stability: The relatively low standard deviation indicates consistent performance across evaluation episodes

Performance Analysis

The achieved mean reward of -0.37 demonstrates significant improvement over random baselines. In the PandaReachDense-v3 environment, rewards are typically negative and approach zero as the agent becomes more proficient at reaching targets. The substantial improvement from the baseline of approximately -3.5 indicates the agent has successfully learned to:

  • Navigate the robotic arm efficiently toward target positions
  • Minimize unnecessary movements and energy expenditure
  • Achieve consistent reaching behavior across varied target locations

Usage

Installation Requirements

pip install stable-baselines3[extra]
pip install huggingface-sb3
pip install pybullet
pip install gym

Loading and Using the Model

import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

# Load the trained model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

# Create the environment
env = gym.make("PandaReachDense-v3")

# Evaluate the model
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()  # Optional: visualize the agent
    if done:
        obs = env.reset()

env.close()

Advanced Usage: Fine-tuning

import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

# Load the pre-trained model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

# Create environment for fine-tuning
env = gym.make("PandaReachDense-v3")

# Continue training (fine-tuning)
model.set_env(env)
model.learn(total_timesteps=100000)

# Save the fine-tuned model
model.save("fine_tuned_a2c_panda")

Evaluation Script

import gym
import numpy as np
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

def evaluate_model(model, env, num_episodes=10):
    """Evaluate the model performance over multiple episodes"""
    episode_rewards = []
    
    for episode in range(num_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            action, _states = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_reward += reward
        
        episode_rewards.append(episode_reward)
        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
    
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)
    
    print(f"\nEvaluation Results:")
    print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
    
    return episode_rewards

# Load and evaluate the model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

env = gym.make("PandaReachDense-v3")
rewards = evaluate_model(model, env, num_episodes=20)
env.close()

Training Information

Hyperparameters

The model was trained using A2C with the following key characteristics:

  • Policy: Multi-layer perceptron (MLP) for both actor and critic networks
  • Environment: PandaReachDense-v3 with dense reward shaping
  • Training Framework: Stable-Baselines3

Training Environment

  • Observation Space: Continuous state representation including:
    • Joint positions and velocities
    • End-effector position
    • Target position
    • Distance to target
  • Action Space: 7-dimensional continuous control (joint torques/positions)
  • Reward Function: Dense reward based on distance to target with sparse completion bonus

Limitations and Considerations

  • Environment Specificity: Model is specifically trained for PandaReachDense-v3 and may not generalize to other robotic tasks
  • Simulation Gap: Trained in simulation; real-world deployment would require domain adaptation
  • Deterministic Evaluation: Performance metrics based on deterministic policy evaluation
  • Hardware Requirements: Real-time inference requires modest computational resources

Citation

If you use this model in your research, please cite:

@misc{a2c_panda_reach_2024,
  title={A2C Agent for PandaReachDense-v3},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Adilbai/a2c-PandaReachDense-v3}}
}

License

This model is distributed under the MIT License. See the repository for full license details.

Downloads last month
1
Video Preview
loading

Evaluation results