In Reinforcement Learning (RL), multiple optimal policies may exist that produce the optimal value function. For example, when learning a walking behavior at a certain speed, there are often an infinite number of walking styles to achieve the specified speed. However, many existing RL methods are typically limited to obtaining only one of the optimal policies in a stochastic manner. In this study, we propose a method that can learn diverse solutions by maximizing state-action-based mutual information. While the left video shows the results in the online RL setting, we also propose an algorithm for offline RL.
T. Osa and T. Harada. Discovering Multiple Solutions from a Single Task in Offline Reinforcement Learning. Proceedings of the International Conference on Machine Learning (ICML), 2024, to appear.
[ arXiv][ website]
T. Osa, V. Tangkaratt and M. Sugiyama. Discovering Diverse Solutions in Deep Reinforcement Learning by Maximizing State-Action-Based Mutual Information. Neural Networks, Vol. 152, pp. 90-104.
[ arXiv][ paper] [ website ]
The objective function used in trajectory optimization is often non-convex and can have an infinite set of local optima. In such cases, there are diverse solutions to perform a given task. Although there are a few methods to find multiple solutions for motion planning, they are limited to generating a finite set of solutions. To address this issue, we developed an optimization method that learns an infinite set of solutions in trajectory optimization. In our framework, diverse solutions are obtained by learning latent representations of solutions.
T. Osa. Motion Planning by Learning the Solution Manifold in Trajectory Optimization.
The International Journal of Robotics Research (IJRR), Vol. 41, No. 3, pp. 291-311.
[ arXiv ][ paper]
Excavation, which is one of the most frequently performed tasks during construction often poses danger to human operators. To reduce potential risks and address the problem of workforce shortage, automation of excavation is essential. In this study, we investigate Qt-Opt, which is a variant of Q-learning algorithms for continuous action space, for learning the excavation task using depth images. Inspired by virtual adversarial training in supervised learning, we proposed a regularization method that uses virtual adversarial samples to reduce overestimation of Q-values in a Q-learning algorithm.
T. Osa and M. Aizawa, Deep Reinforcement Learning with Adversarial Training for Automated Excavation using Depth Images,
IEEE Access, Vol. 10, pp. 4523-4535, 2022.
[ paper(open access)]
T. Osa, N. Osajima, M. Aizawa, T. Harada, Learning Adaptive Policies for Autonomous Excavation Under Various Soil Conditions by Adversarial Domain Sampling,
IEEE Robotics and Automation Letters, Vol. 8, No. 9, pp. 5536-5543, 2023.
[pdf][ publisher website]
Existing motion planning methods often have two drawbacks: 1) goal configurations need to be specified by a user, and 2) only a single solution is generated under a given condition. However, it is not trivial for a user to specify the optimal goal configuration in practice. In addition, the objective function used in the trajectory optimization is often non-convex, and it can have multiple solutions that achieve comparable costs. In this study, we propose a framework that determines multiple trajectories that correspond to the different modes of the cost function.
T. Osa. Multimodal Trajecotry Optimization for Motion Planning,
The International Journal of Robotics Research (IJRR), Vol. 39(8) 983–1001, 2020.
[ arXiv ]
Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL is a challenging task. We proposed an HRL method that learns a latent variable of a hierarchical policy using mutual information maximization.
T. Osa, V. Tangkaratt, and M. Sugiyama. Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization,
International Conference on Learning Representation (ICLR), 2019.
[ arXiv ]
We developed a motion planning framework that combines the advantages of optimization-based and demonstration-based methods. A distribution of trajectories demonstrated by human experts is used to guide the trajectory optimization process in our framework. The resulting trajectory maintains the demonstrated behaviors,which are essential to performing the task successfully, while adapting the trajectory to avoid obstacles.
T. Osa, A. M. Ghalamzan, E., R. Stolkin, R. Lioutikov, J. Peters, and G. Neumann. Guiding Trajectory Optimization by Demonstrated Distributions, IEEE Robotics and Automation Letters (RA-L), Vol.2, No.2, pages 819-826, 2017.
[ paper ]
We developed a framework for hierarchical reinforcement learning of grasping policies. In our framework, the lower-level hierarchy learns multiple grasp types, and the upper-level hierarchy learns a policy to select from the learned grasp types according to a point cloud of a new object. Through experiments, we validate that our approach learns grasping by constructing the grasp dataset autonomously. The experimental results show that our approach learns multiple grasping policies and generalizes the learned grasps by using local point cloud information.
T. Osa, J. Peters, G. Neumann. Experiments with Hierarchical Reinforcement Learning of Multiple Grasping Policies, Proceedings of the International Symposium on Experimental Robotics (ISER), 2016.
[ paper ]
This study presents a framework of online trajectory planning and force control by learning from demonstrations. By leveraging demonstration under various conditions, we can model the conditional distribution of the trajectories given the task condition. This scheme enables generalization of the trajectories of spatial motion and contact force to new conditions in real time. In addition, we propose a force tracking controller that robustly and stably tracks the planned trajectory of the contact force by learning the spatial motion and contact force simultaneously.
T. Osa, N. Sugita, and M. Mitsuishi, Online Trajectory Planning and Force Control for Automation of Surgical Tasks, IEEE Transactions on Automation Science and Engineering, 2017
[ paper ]