Manipulation of real-world objects is one of the most important limits of machine intelligence.
As humans, we can easily grasp and pick up objects that we’ve never seen before, but even the most advanced robotic arms can’t manipulate objects that they weren’t trained to handle.
I’ve been working with the engineering department at San Jose State University (SJSU) to develop these techniques for robotics along with new manufacturing/design for developing cheaper and easier to use medical devices.
Specificially, we are focusing on applying our work to prosthetics but these can be scaled across other medical assistive devices that people use on a day-day basis.
Problem of Generalization
Reinforcement learning (RL) offers a promising avenue for tackling this problem, but current work on reinforcement learning masters only individual skills. To create generalization which better meets of real-world manipulation, we focus specifically on scalable learning with off-policy algorithms and study this question in the context of the specific problem of grasping.
Using V-REPs motion planning and object grasping scene, I simulated a Jaco arm for manipulation tasks including dynamic particles, motion planning and inverse kinematics features. The model is trained and run on an algorithm that I reproduced very similar to QT-Opt. Check it out!
This approach was able to achieve a 78% grasp success rate on previously unseen objects and 84% on household objects that are moved during the grasp attempt.
I’ll be diving deeper into how the manipulation part of the algorithm works with a brief highlight on the dynamic particles, inverse kinematics and motion planning!
Q-learning and CNN’s
Generalization requires diverse data, but this can be very difficult with on-policy algorithms that evaluate and improve the same policy which is being used to select actions. This agent would not be good since it never explores.
Q-learning is an off-policy learner where learning happens from taking different actions (for ex., from random actions) where a policy isn’t needed. It seeks to find the best action to take given the current state. Q-learning seeks to learn a policy that maximizes the total reward.
This approach specifically is an off-policy framework based around a continuous generalization of Q-learning where you train only a Q-function, and induce a policy implicitly by maximizing this Q-function.
QT-Opt is a reinforcement learning algorithm that allows robots to improve their grasping capability after watching hundreds of thousands of real-world grasping examples. At the core is a CNN which represents the robot’s grasping logic (its Q function).
QT-Opt successfully trains a neural network in a closed-loop system, allowing the robot to learn useful techniques like regrasping in case of a failed grasp, pre-grasp manipulation of objects, grasping in clutter, and handling dynamic objects. This is the original paper for QT-Opt that is referenced throughout this article.
Markov Decision Process (MDP)
This closed-loop vision-based control framework is based on a general formulation of robotic manipulation as a Markov Decision Process (MDP).
At each time step, the policy observes the image from the robot’s camera and chooses a gripper command. The grasping task is defined simply by providing a reward to the learner during data collection.
The state observation includes the robot’s current camera observation, which is in the form of an RGB image. This also includes the status of the gripper in the state, which is a binary indicator of whether the gripper is open or closed.
The action consists of a vector in Cartesian space indicating the desired change in the gripper position.
A successful grasp results in a reward of 1, and a failed grasp a reward of 0. A grasp is considered successful if the robot holds an object above a certain height at the end of the episode.
Success is determined by using a background subtraction test after dropping the picked object. This can be seen in the following grasp execution and termination condition flow:
The manipulator is controlled by creating a closed-loop around visual input and gripper commands. At the end of an episode, the grasp success is detected and the sequence of state-action pairs are returned to be fed back into the learning pipeline or stored offline. Depending on the executed policy, the termination action is either learned or decided heuristically.
Q-Function representation as CNN
The Q-function is represented by a large convolutional neural network, where the image is provided as an input into the bottom of the convolutional stack, and the action, gripper status, and distance to the floor are fed into the middle of the stack.
The input image is processed through convolutional layers where then the action and vector-valued state variables (gripper status and height) are defined. These are processed by several fully connected layers, with width and height dimensions of the convolutional map. The resulting convolutional map is further processed by a number of convolutional layers. Finally, the output is gated through a sigmoid so that the Q-values are always in the range [0, 1].
We had to train it on a large and diverse set of objects. The dataset that I used to train was collected during multiple separate experiments, and each experiment reused the data from the previous one. The policy used was randomized but biased toward reasonable grasps, and later was switched to using the learned QT-Opt policy once it reached a high enough success rate.
Researchers have been working on a distributional enhancement of QT-Opt called Quantile QT-Opt (Q2-Opt). This approach achieved a very high grasping success rate, while also being more sample efficient. To learn more, you can read this blog.
Inverse Kinematics (IK)
Inverse Kinematics is how machines calculate exactly how they can move their joints to reach a desired target. For grasping this is essentially how machines need to move to grasp an object based on the object and grasp position proposed.
Currently, this is done through very complex math and takes a lot of computational power. Although today there are various packages especially in ROS that do these calculations automatically. As an example, VREP (the simulation software I use) has built-in kinematics functionality via the GUI.
The grasping end of a robot arm is often the end-effector. The robot configuration is a list of joint positions that are restricted by the position limits of the robot model (such as degrees of freedom).
Given the desired robot’s end-effector positions for grasping an object, IK can determine an appropriate joint configuration for which the end-effectors move to the target pose.
Once the robot’s joint angles are calculated using the inverse kinematics, a motion profile can be generated using the Jacobian matrix to move the end-effector from the initial to the target pose. The Jacobian matrix helps define a relationship between the robot’s joint parameters and the end-effector velocities.
Generally, robots have multiple solutions to inverse kinematics. If you’d like to dive deeper into this topic, check out this article I’ve written on how we can apply RL to solve IK problems.
Dynamic Particles & Motion Planning
Motion planning of a robot sounds like a very simple task at hand but it’s one of the harder parts of research that people are still working on.
A motion planner is an algorithm that automatically plans the route, trajectory or path that the robot will travel to get from Point A to Point B.
In real-life, these algorithms speed up the programming process when the robot is in a complex environment (where there are lots of obstacles in its way). Instead of planning every single move ourselves, the motion planner can automatically create one or more good routes for the robot to follow to reach the desired place or goal.
VREP offers path/motion planning functionality via a plugin wrapping the OMPL library.
This is accomplished with these key steps:
- It is defined what the start and goal state is but if the path planning object is a serial manipulator, an end-effector position is often provided instead of a goal state.
- A new path planning task is created and an algorithm is selected.
- Create the required state space and specify which entities are not allowed to collide with.
- Compute one or several paths and destroy the path planning task once the task is completed.
Often, path planning is used in combination with inverse kinematics.
Moving Forward: Testing
So far we have only tested these approaches in a simulation environment and although they seem promising, our next goal is to test them in real life. We will be working on making adaptions based on the prosthetic arm and then running initial tests on it using a generative grasping CNN and then the Q-leaning framework outlined here.
Our final results including the mechanical design/improvements will be published in a paper very soon. Stay tuned!