Robotic hand that can see for itself

Using Deep Learning to build a cheaper, vision prosthetic arm.

10 min readMay 20, 2020

There are about 2 million amputees in the U.S. alone, and that number is expected to nearly double to 3.6 million by 2050. However, current prosthetics aren’t accurate with their grasp and grip control. For cheaper prosthetics, grasp (ability to interact with real-world objects) must be manually controlled by the user which requires a lot of training and still often not very accurate.

Although there are more automated ways to control grasp, these prosthetics can end up being $100,000+, making them difficult to afford by most users.

To solve this problem, I’ve been working alongside mentors from Kindred.ai on building a cheaper, more automated and more accurate grasping program for a robotic prosthetic arm using deep learning.

Robotic prosthetic arm I’ve been building!

Before I get into how I built this, lets better understand how prosthetics today work and more about the problem.

How do Prosthetics Work?

Your brain controls the muscles in your limbs by sending electrical commands down the spinal cord and then through peripheral nerves to the muscles. This information pathway would be blocked if you had a limb amputated. The peripheral nerves would still carry electrical motor command signals generated in the brain, but the signals would meet a dead end at the site of amputation and never reach the amputated muscles.

We essentially try to achieve this same thing with a prosthetic arm so that we can still do daily tasks (however, most of them do this without the feeling being returned to the person). Today, when a wearer of a prosthetic arm wants to grab something there are three main ways in which this single is sent:

Grip mechanism might be controlled mechanically = a cable attached to the opposite shoulder.
Signal to grip is detected using myoelectric sensors (these will read muscle activity from the skin).
Sensors to measure nerve signals that are implanted inside the muscle.

1. Body-Powered Mechanical Control

Body-powered prostheses are the most commonly used artificial arm. A body-powered prosthetic arm is held in place by suction or by a strap system that rests on your shoulders. A cable and harness system controls the device.

There are two main types of body-powered hand prostheses:

Voluntary Open: opens the hand when applying tension to the cable
Voluntary Close: closes the hand when applying tension to the cable

→ Barriers/Current Problems:

Cost: ~$30,000
Rejection rate 16–58% (often uncomfortable)
Restrict the range of movement of arm
Difficult to finely control grasp and grip

2. Myoelectric Sensors

Myoelectric prostheses do not require a harness and are controlled by nerves and electrical impulses from the wearer’s residual limb. When we move our muscles, small electrical fields are generated around them, which can be measured by electrodes. Sensors or electrodes in the prosthetic socket detect your muscle contractions and send commands to operate the high performance, battery-operated prosthetic motors.

They often use two electrode sites to sense muscle contractions from two major muscle groups.

→ Barriers/Current Problems:

Cost: ~$100,000 (expensive)
Batteries need to be recharged often
Lengthy training period to adapt
Difficult to finely control grasp and grip

3. Implanted Myoelectric Sensors (IMES)

The IMES system is a device is used to improve signal quality and consistency of myoelectric signals for prosthetic control.

The end caps serve as electrodes for picking up EMG activity during muscle contraction.
Reverse telemetry (via a coil around the arm) is used to transfer data from the implanted sensor, and forward telemetry is used to transmit power and configuration settings to the sensors.
The coil and associated electronics are housed within the frame of a prosthesis. A control system that sends data associated with muscle contraction to the motors of the prosthetic joints is housed in a belt-worn, battery-powered device. A cable attaches the control unit to the prosthetic frame.

An IMES is implanted into each targeted muscle that will be used to control a function of the prosthetic arm. Two devices would be needed for DOF (one device would control fingers opening and another device would control fingers closing).

→ Barriers/Current Problems:

Total Cost: ~$1,382,120 (very expensive)
Invasive
Difficult to finely control grasp and grip

The Problem

To summarize, these are the main problems with current prosthetic arms:

Large training times
Not adaptive
Expensive
Inaccurate control grasp and grip

What if there was a way to have full functionality of a prosthetic arm with smaller training times, at a much cheaper price with increased precision of control grasp and grip.

This was my goal with this project, specifically, I had mounted a USB camera onto a 3D printed prosthetic arm that using a Convolutional Neural Network (CNN) is able to detect objects and identify ways for the arm to manipulate them. This creates for a cheaper and more effective alternative to current prosthetic arms. Specifically:

Expensive — this problem can be solved by using a 3D printed arm which can be made for as little as $50.
Inaccurate control grasp and grip — this problem can be solved by using various CNN methods.
Large training times — this problem can be solved by using imitation learning to reduce the training time needed to design customized prosthetics. (I’ll be briefly mentioning this in today's article but more in another article).
Not adaptive — this problem can be solved by using various Reinforcement Learning techniques to work with different people and in different environments. (I’ll be briefly mentioning this in today’s article but more in another article).

My Process:

Building Pick and Place Arm: I started by building a basic pick and place arm that using a camera was programmed on an Arduino. This helped me build out a mini-robot arm and test basic grasping techniques:

Designing and Printing 3D Parts: I used Fusion 360 to manipulate some 3D printed prosthetic arm files and got them printed:

Assembling Arm: after getting the 3d printed parts, I spent some time putting together the arm, the forearm and attaching the motors to the fingers:

Training the CNN on real-time objects: after building my CNN model I trained it on Cornell’s Dataset for identifying grasping rectangles in images of real-time objects and sending instructions to the arm to manipulate the objects accordingly.

Mounting Camera, Sensors and Testing: I’m currently working on attaching depth sensors, mounting the camera and running my model on a GPU to test how it performs on novel objects.

For the rest of the article, I’ll be diving deeper into how I built the CNN model for object detection and grasping:

Object Detection and Grasping

My system consists of the 3D printed prosthetic arm, with an NVIDIA GPU and a USB camera:

The camera would send frames of images to the GPU, which would identify the type of object it was/grasping rectangles.
Then send this information to the prosthetic arm.
The arm could then move in a way that would allow it to best interact with the identified object.

For the robotic arm to interact with the identified object we need a grasping implementation which has the following sub-systems:

Grasp detection sub-system: To detect grasp poses from images of the objects in their image plane coordinates.
Grasp planning sub-system: To map the detected image plane coordinates to the world coordinates.
Control sub-system: To determine the inverse kinematics solution of the previous sub-system.

The architecture I used includes a grasp region proposal network for the identification of potential grasp regions. The network then partitions the grasp configuration estimation problem into regression over the bounding box parameters, and classification of the orientation angles, from RGB-D image inputs.

Grasp Configurations

Given corresponding RGB and depth images of a novel object, we want to be able to identify the grasp configurations for potential grasp candidates of an object so that we can manipulate it.

We can use the 5-dimensional grasp rectangle as the grasp representation here which describes the location, orientation, and opening distance of a parallel gripper prior to closing on an object. The 2D orientated rectangle, shown below depicts the gripper’s location (x, y), orientation θ, and opening distance (h). And an additional parameter describing the length (w) is added for the bounding box grasp configuration.

As seen in c) Each element in the feature map is an anchor and corresponds to multiple candidate grasp proposal bounding boxes.

CNN Model:

I used a ResNet-50 with 50 layers for feature extraction and grasp prediction. This architecture helps consider multiple objects in a scene and performs significantly better than the AlexNet (which is otherwise commonly used for this task). These are the key features of this model:

Grasp Proposals: This is the first stage of the deep network and it aims to generate grasp proposals across the whole image. The Grasp Proposal Network works as a mini-network over the feature map.
Grasp Orientation as Classification: Rather than performing regression, we formulate the input/output mapping as a classification task for grasp orientation. This means if none of the orientation classifiers outputs a score higher than the non-grasp class, then the grasp proposal is not used.
Multi-Grasp Detection: This last step identifies candidate grasp configurations. It classifies the predicted region proposals from the previous stage into regions for grasp configuration parameter. This part also refines the proposal bounding box to a non-oriented grasp bounding box (x, y, w, h).

Structure of multi-object multi-grasp predictor

To Summarize:

The network takes RG-D inputs, and predicts multiple grasps candidates with orientations and rectangle bounding boxes for each object in the view.
Blue blocks (on the above image) indicate network layers and gray blocks indicate images and feature maps.
Green blocks show the two-loss functions. The grasp proposal network slides across anchors of intermediate feature maps from ResNet-50 with k = 3×3 candidates predicted per anchor.

Dataset

I used the Cornell Dataset which consists of 885 images of 244 different objects, with several images taken of each object in various orientations or poses. Each distinct image is labelled with multiple ground truth grasps corresponding to possible ways to grab the object.

The Cornell dataset is preprocessed to fit the input format of the ResNet-50 network which mostly consisted of resizing the images (227x227) and substituting in the depth channel.

Example Outputs

b) showcases the top grasp outputs for several objects; c) output grasps (red) and ground-truth grasps (green) showing that the system may output grasps for which there is no ground truth; (d) multi-grasp output for several objects. The green rectangles are ground truth and the red rectangles represent predicted grasps for each unseen object.

Future Applications: Reinforcement Learning

Reinforcement Learning (RL) in the field of machine learning figures out what to do and how to map situations to actions. The end result is to maximize the numerical reward signal but instead of telling the learner what action to take, they must discover which action will result in the maximum reward.

The ultimate goals with this section are to figure out:

How to automatically adapt control systems of prosthetic arms to the needs of individual patients.
How to improve limb control based on patient feedback over time.
Both these will make the arm more adaptive and personalized.

Using RL, we can get an agent to learn the optimal policy for performing a sequential decision without complete knowledge of the environment.

The agent first explores the environment by taking action and then edits the policy according to the reward function to maximize the reward. We use the Deep Deterministic Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) to train the agent.

We could use Imitation learning where the learner tries to mimic an experts action in order to achieve the best performance. Possibly by implementing DAgger algorithm.

I know I just dropped a bunch of algorithms but if you’re interested I’ll be going deeper into this in a future article.

Next Steps:

I’m going to continue training my CNN model.
I’m going to test the model with real-time objects to see how it performs.
Research into how we could feasibility apply RL to improve design, training and adaptability of prosthetic arms.
If you want to stay posted on my progress, feel free to follow me or reach out on twitter.

That’s it for now ✌