More accessible drugs to people in developing countries using Machine Learning

10 min readApr 3, 2019

Bridging the gap for developing countries to access new drugs

At first glance, we look at the picture below and feel happy that this kid from a poor family in Africa is getting some sort of medicine that will probably help him. Well, this is not the truth.

Meet the Kongo family from Nigeria. Their kid was diagnosed with Malaria and was prescribed medicine to help with his case of malaria. Where they live, there wasn’t much variety of medicine for them to get and the medicine available was either very old or expensive. They cannot afford the medicine needed. Imported brand-name pharmaceuticals are too expensive, and so they turned to counterfeit drugs that they felt might be better than nothing. This is not the case though, we know that counterfeit drugs are ineffective, but can even be harmful.

This is not just a reality for the Kongo family but pretty much 60% of most developing countries.

A Global Problem in Developing Countries

Developing countries, where 70 percent of the world’s population live, produce only 7 percent of the drugs they consume. Because of expensive drugs, people in developing countries are lacking access to new developments in medicine which can help their diseases and cure.

Price of these drugs is the topmost barrier for accessibility of drugs for people in these developing countries. The second is storage and distribution problems, but these are linked to poor infrastructure in general — which will be a solution for one of my later articles.

Currently, 60% of essential medicines and 70% of new drugs are not accessible to people in Africa, South-East Asia, and the Western Pacific. Often only medicines like analgesics are manufactured in developing countries, while remedies for life-threatening diseases like TB or HIV/AIDS are imported, and therefore are much more expensive.

So these things are clear:

Most (pretty much all) of drugs in developing countries are shipped from elsewhere;
The cost of these medicines and holding them in developing countries is way too expensive.

I know this is true because:

Bringing a new drug to market is estimated to cost major pharma firms more than $4 billion.
The process of drug discovery is very long and expensive and can take 10–15 years.
Fewer than 10% actually make it to market.

It’s crazy to me that people in developing countries will miss out on the opportunity to have access to new developments in drugs which will treat them better than existing and old treatments. So what if we just found a way to make the process of drug discovery faster, cheaper, better and therefore, more accessible to people in developing countries.

Faster drug production = Improved Clinical Trials

Clinical trials are a necessary step in drug development and are conducted throughout the world, both in developed and in developing countries. For terminal illnesses like cancer, patients usually only enroll in a drug trial when existing forms of treatments have failed. On top of that, not all patients diagnosed are eligible to participate.

For those that are eligible, participating in a trial is cost and time-intensive, while data collection methods also suck.

The process is inefficient for other stakeholders too: costing healthcare industries, governments and academic research hospitals billions of dollars, and further drive up costs and delay life-saving treatments to patients and in some cases lead to adverse events. Drug trials average nearly a decade, costing up to billions of dollars. Many trials also fail due to enrollment issues.

The $65B clinical trials market needs a makeover. This is where Machine Learning comes in to speed up the process + make treatments more effective especially for people in developing countries diagnosed with fatal diseases. I’ll be speaking more about clinical trials later in the article.

Machine Learning for Computational Drug Discovery

I applied my skills in Machine Learning to find a way in which we can use algorithms to do this for us quicker, cheaper and more effectively.

Machine learning is 80% data processing and cleaning and 20% algorithms. This is literally the same in chemistry and biology.

Before we get into how this would work, we need to understand how drugs work.

Drugs: “Mechanism of Action”

Our body is made up of various proteins and many of these proteins are enzymes, meaning that they speed up the rate of a chemical reaction without changing themselves. Other proteins are basically signaling pathways that control highly specific cell adaptations and reactions. Agonists ( antagonist) tend to speed up reactions. Proteins ‘receive’ signals from other molecules, so they can also be called receptors.

Binding drug molecules (compounds) are structurally similar to native ligands (like the “original” keys), they do bind, but fail to induce a reaction and block it by displacing the latter. They are like a fake duplicate key, close enough to fit into the hole, but not of the right shape to turn the lock.

Molecule (imatinib) bound to protein (spleen tyrosine kinase).

Proteins and molecules are always in motion. Always bouncing and wiggling around (which makes it so that there is not clear static binding pose). There are also other small molecules like vitamins which are involved in the reaction.

Most current drugs are “small molecules”. They usually consist of less than 10–100 atoms — compared to proteins (thousands). Other classes of medicines exist and are being developed, such as biologic drugs or therapeutic antibodies.

Stages of Drug Development

These are the current stages of drug development:

Target selection and validation. Analyzing biological pathways or identifying “druggable” targets.
Hit discovery. Screen millions of library compounds, as many and as diverse as possible, to uncover novel activities.
Hit to lead. Because (V)HTS data is huge but of low accuracy, any potential hits has to be confirmed, ideally using multiple independent types of assays.
Lead Optimization. Affinity is only one of several factors that decide if a compound can become a practically viable medicine. Other factors are pharmacodynamics (biochemical and physical effects of drugs on the living organism) and pharmacokinetics (how the living organism acts on the drug). ADMET: absorption, distribution, metabolism, and excretion, toxicity are the main things you’re looking for.
Pre-clinical development. Animal testing and other methods used here.
Clinical trials. 3 phases with larger groups of patients, to ensure safety and effectiveness.

Where is the place of AI in this process?

I think that very soon it is likely that we will be able to go from in a computer to patients immediately with the AI-driven drug discovery pipelines.

The initial stage of identifying the lead molecules involves live experiments in the lab which are still very slow and expensive because we would like to find lead molecules as accurately as we can. Even if the goal is to treat cancer there is no hope to check the entire endless variation of small molecules in the lab.

72 million is just the size of a specific database, the total number of small molecules is estimated to be between 1⁰⁶⁰ and 1⁰²⁰⁰, and synthesizing and testing a single new molecule in the lab may cost thousands or tens of thousands of dollars. Obviously, the early guessing stage is really, really important but is very ineffective right now.

We can use machine learning models to try and choose the molecules that are most likely to have the right properties. We are basically generating a molecule from scratch, and not just some molecule, but a promising candidate for a drug. We can use Generative Adversarial Networks (GANs) to do this.

Generative adversarial networks (GANs)

GANs are a class of neural networks that aim to learn to generate objects from a certain class, e.g., images of human faces or bedroom interiors. To perform generation, GANs have two parts that are in competition with each other:

the generator which is trying to generate new objects that are supposed to pass for “true” data points;
the discriminator is trying to distinguish between real data points and the ones produced by the generator.

Basically, the discriminator learns to spot the generator’s fake images, while the generator learns to fool the discriminator.

This is what the general scheme looks like:

This might seem a bit confusing at first because GANs are used to train for continuous structures. The atomic structure is totally not continuous. Still, GANs can work for generating molecules as well. Let’s find out how.

Adversarial Autoencoders

Kadurin et al. has presented an architecture for generating lead molecules based on a variation of the GAN idea called Adversarial Autoencoders (AAE). In AAE, the idea is to learn to generate objects from their latent representations. Autoencoders are neural architectures that take an object as input… and try to return the same object as output. Sounds easy but the idea is that in the middle of the architecture, the input must go through a middle layer that learns a latent representation (for example, a set of features that encode the input in such a way that afterward subsequent layers can decode the object back):

This works by either having the middle layer as the smaller (has lower dimension) than input and output, or the autoencoder uses special regularization techniques, but in any case it’s impossible to simply copy the input through all layers, and the autoencoder has to extract the really important stuff.

So what did Kadurin et al. do? They took a conditional adversarial autoencoder and trained it to generate fingerprints of molecules, using and serving desired properties as conditions. In their model, adversarial autoencoders (AAE) were trained on a data set of fingerprints for molecules that had been known to be effective against a certain target. The resulting model was able to capture the underlying patterns in fingerprint structure. It was then used to propose new structures that could correspond to other effective molecules, and generated fingerprints were matched against a library of known molecules to select the most relevant molecular structures.

Looks just like the autoencoder above, but with two important additions in the middle:

on top, there is a discriminator that tries to distinguish the distribution of latent representations from some standard distribution;
on the bottom, there is a condition that in this case encodes desired properties of the molecule; we train on the molecules with known properties, and the problem is then to generate molecules with desired combinations of properties.

So how do we generate discrete structures like molecules? You can use a standard representation of a molecule as a MACCS fingerprint, a set of binary characteristics of the molecule such as “how many oxygens is has” or “does it have a ring of size 4”.

Basically, you put the properties of a molecule, into more “low-level” properties of the molecular structure encoded into their MACCS fingerprints. Then a simple screening of the database can find molecules with the fingerprints most similar to generated ones.

Encoding Structure of a Molecule

There are many different ways to encode the structure of a molecule. Here is the most common way:

Voxel-Based Representations: A way of representing solids (with applications to representing density functions). Usually, a 3-D cubical array, with each element holding one (or more) data value (boolean, real).

To do this we first apply principal components analysis (PCA) to extract the primary axes of the molecule. We then translate the molecule into the origin and orient it along extracted directions. Finally we discretize the 3D space into a regular grid with element size of 0.5 Å, which ensures that no two atoms fall into the same voxel (3D cell).

We then select molecules that contain only these nine atoms, resulting in a 9-dimensional vector. The resulting representation for a sample molecule is shown where different colors show different atoms:

Faster, cheaper, better Drug discovery

It is clear to me that using GANs and Machine Learning methods we can help eliminate a huge process of drug discovery which also costs the most. By making this process faster and cheaper, we will be able to help families like the Kongo family have better access to medicine. I will be honest, this will not remove all barriers but will significantly ensure that families and countries can start affording new medicine.

What will it take to make significant progress in drug discovery? In my view, the most important thing needed is high-quality data. Moving forward, I am confident we can overcome this issue and start to create really cheap and readily available medicine. This is just a stepping stone.

I’m Alishba Imran.

I am a Blockchain, VR and Machine Learning developer interested in medicine and healthcare. If you want to stay up to date with my progress feel free to follow me on LinkedIn, and Medium! If you enjoyed reading this article, please press the👏 button, and share!