Hospitals are losing $41.3B due to this problem

Here’s how I solved it

Alishba Imran
DataDrivenInvestor

--

One of the biggest problems in healthcare is the rising cost of patient readmission. This is when a hospital who recently discharged a patient gets penalized for having that patient return to a hospital within 30 days. This often costs hospitals a total of $528 million.

American hospitals alone have spent over $41 billion on diabetic patients who got readmitted within 30 days of discharge 🛏️. Being able to determine factors that lead to higher readmission in such patients, and being able to predict which patients will get readmitted can help hospitals save millions of dollars while improving quality of care.

I’m a 16-year-old Machine Learning innovator 🤓 who recently got very passionate about systematic problems in our healthcare system. After speaking with top data scientists and researchers at top hospitals like the UHN, Sickkids, I realized a lot of the problem is that we aren’t fully leveraging EHR data to make better decisions.

I’ve spent some time exploring different Machine Learning algorithms (gradient boosting classifier) and libraries such as SMART on FHIR; risk scores, data visualizations, and preventative measures to solve this problem. I’ve built something to predict this with a 92% accuracy and I’ll be talking about how I built it later.

Breaking down the problem

By identifying patients at risk for these conditions and pinpointing which patients are the most likely to end up back in the hospital after discharge will allow us to take preventative measures beforehand.

The first part of the problem is identifying what factors are the strongest predictors of hospital readmission in diabetic patients. There are often a lot of factors we need to consider of which not all are always available for each patient:

Examples of 79 factors to consider

Here we enter into a data problem: we often aren’t collecting the right data that we need specific to a patient. There is no interactive way for us to engage with patients 1on1 after they touch-base with the hospital and collect real-time data overtime. This problem can be significantly solved by improving our patient portals.

Our current patient portals aren’t sufficient enough.

Patient portals today have limited capabilities and often don’t provide the patient with a lot of value. Often starting with basic capabilities patients can view up to:

  • General medical and health information
  • Laboratory and test results
  • Appointment details
  • Electronic means of communicating with your physician

However, most patients still feel that they don’t get a lot of healthcare updates from these apps and physicians can’t rely on them to make informed decisions. If we can constantly collect data through the patient portal, provide patients with risk scores and preventable measures we can improve the experience and 10x the value we get from this platform. This is where Electronic Health Records (EHR) come into play.

Why EHR?

More than a third of hospitals in the United States are operating at a loss and many of them are unable to serve all patients.

Patients treated in hospitals with advanced EHRs cost, on average, $731, or ~10%, less than patients admitted to hospitals without advanced EHRs, after controlling for patient and hospital characteristics.

By adopting EHRs globally, our health system could save as much as $78 billion a year through drivers such as reducing medical errors, eliminating redundant testing, and promoting preventive care by opening access to information.

But we still haven’t tapped into the full potential of these systems to be a gateway for solving important problems like patient readmissions. Electronic health records still have the potential to make health care more predictive, preventive, and precise. These are the top three shortcomings of EHR systems today:

  • Making these systems more versatile for different types of functions + better integration into workflow (lab reports, patient data, risk scores)
  • Synthesizing data in a way that adds more value to the patient and the physician (predictive models, data visualizations)
  • Using artificial intelligence to synthesize patient records; combine them with the medical literature, and provide insights at the point of care.

With these problems in mind, I decided to develop Everva as a way to solve these main problems by integrating risk scores, data visualizations and preventative measures.

Patient Portal: Everva

Dashboard for Everva

Everva is an EHR web app that I worked to integrate with existing EHR systems to help patients interpret their health data and physicians better follow-up with patients' condition after discharge.

I’ve built this application using the SMART on FHIR platform to create an interface that provides interactive visualizations of clinically validated risk scores and longitudinal data derived from a patient’s clinical measurements.

These visualizations allow patients to investigate the relationships between key clinical measurements and risk over time.

Patients are provided with the opportunity to actively improve their health based on an increased understanding of longitudinal information available in EHRs and to begin a dialogue with their healthcare providers. As well, for healthcare providers to continuously measure the risks a patient may be going through and how to treat them the best so readmissions are less likely or more accounted for.

The applications used are:

The prediction part of my web app using Machine Learning to predict hospital readmissions and here’s how that works:

Machine learning to predict hospital readmissions

The next part of the solution is using machine learning to predict hospital readmission in diabetic inpatients with an accuracy of 92%.

In the US, they have an organization called CMS’ Hospital Readmissions Reduction Program that collects penalties on Medicare payments for each fiscal year. Hospitals have to pay a penalty ranging from 0.5–3%.

There is often a huge difference in postindex hospitalization costs among the three groups: $73,252 for patients readmitted within 30 days and $62,053 for those readmitted beyond 30 days versus $5,719 for patients not readmitted.

Hospital readmission costs are much higher than initial admission costs for about two-thirds of common diagnoses. On average, it costs $7,400 per readmission. Most of this cost is the treatment for the disease (costs more to fix once the patient comes back since the case becomes more complex). Then, depending on how many readmissions you have on a statistical basis compared to other hospitals you may be charged a penalty % on your Medicare payments.

It costs hospitals around $7,400 per readmission. But it only costs them $1,000-$2,000/patient to keep them depending on the condition and care needed.

If we reduced this by 92% for 1M patients then we could save $6.7B (7,400x920K-1,000X80K).

According to Ostling et al, patients with diabetes have almost double the chance of being hospitalized than the general population. This is why I wanted to focus on predicting hospital readmission for patients with diabetes.

The general steps I followed to build this model were:

  • selecting a dataset
  • feature engineering
  • building training/validation/test samples
  • model selection

I explored various different models for this like: Logistic regression, Decision tree, Random forest, and Gradient boosting classifier each with its own benefits.

  • model evaluation
Overview of the process

Dataset

I used the publicly available dataset from UCI repository (link) containing de-identified diabetes patient encounter data for 130 US hospitals containing 101,766 observations over 10 years. The dataset has over 50 features including patient characteristics, conditions, tests and 23 medications.

Breakdown of the data

Preprocessing and feature engineering

Before we can get to actual modeling, some wrangling with the data is almost always needed. I applied three types of methods here:

  1. Cleaning tasks such as dropping bad data, dealing with missing values.
  2. Modification of existing features e.g. standardization, log transforms etc.
  3. Creation or derivation of new features, usually from existing ones.

We will break down this section into numerical features, categorical features and extra features.

Numerical Features

These features do not need any modification. The columns that are numerical that we will use are shown below:

Categorical Features

The next type of features we want to create are categorical variables. Categorical variables are non-numeric data such as race and gender. I use one-hot encoding, to turn these non-numerical data into variables.

The first set of categorical data we will deal with are these columns:

In one-hot encoding, you create a new column for each unique value in that column. Then the value of the column is 1 if the sample has that unique value or 0 otherwise.

The get_dummies function does not work on numerical data. We have to convert the numerical data from the 3 ID types into strings and then it will work properly.

Now we are ready to make all of our categorical features:

Feature Engineering: Summary

Through this process we created 143 features for the machine learning model. The break-down of the features is:

  • 8 numerical features
  • 133 categorical features
  • 2 extra features

Training/Validation/Test Samples

So far we have explored our data and created features from the categorical data. It is now time for us to split our data. The idea behind splitting the data is so that you can measure how well your model would do on unseen data.

I split the data so that it was 70% train, 15% validation, and 15% test.

We can check what percent of our groups are hospitalized within 30 days.

The prevalence is about the same for each group.

Model Selection

Over 80% of the time is actually spent cleaning and preparing data. I trained my data on top of a few machine learning models and use a few techniques for optimizing them.

I compared and contrasted: Logistic regression, Decision tree, Random forest, and Gradient boosting.

Logistic Regression

Logistic regression is a traditional machine learning model. This linear function is then passed through a sigmoid function to calculate the probability of the positive class. One advantage of logistic regression is the model is interpretable. We can use logistic regression using the following code from scikit-learn:

Decision tree

The simplest tree-based method is known as a decision tree. You keep asking if the samples have a specific variable greater than some threshold and then split the samples to break it down. The final prediction is then the fraction of positive samples in the final split of the tree. One advantage of tree-based methods is that they have no assumptions about the structure of the data and are able to pick up non-linear effects if given sufficient tree depth. We can use decision trees using the following code:

Random forest

One disadvantage of decision trees is that they tend overfit very easily by memorizing the training data. As a result, random forests were created to reduce the overfitting. In random forest models, multiple trees are created and the results are aggregated. To use random forests, we can use the following code:

Gradient boosting classifier

Another approach to improving decision trees is by using a technique called boosting. In this method, you create a bunch of shallow trees that try to improve on the errors of the previously trained trees (gradient descent algorithm). To use the gradient boosting classifier, we can use the following code:

Model Selection: Feature Importance

We can use feature importance for models such as Logistic Regression or Random Forests.

We can get the feature importance from logistic regression using the following:

We can look at the top 50 positive and top 50 negative coefficients to get some insight.

Positive Feature Importance Score
Negative Feature Importance Score

Model Selection: Hyperparameter Tuning

The next thing that we should investigate is hyperparameter tuning. Hyperparameter tuning are essentially the design decisions that you made when you set up the machine learning model. For example, what is the maximum depth for your random forest? Each of these hyperparameters can be optimized to improve the model.

The gradient boosting classifier is the best on the validation set.

Model Evaluation

Now that we have selected our best model (optimized gradient boosting classifier). Let’s evaluate the performance of the test set.

Next Steps

My personal next steps are to keep training the model and improving the accuracy even more. I want to dive deeper and better understand the challenges with data at hospitals.

I’ve also been in talks with data scientists and researchers at top institutions like UHN and Sickkids to possibly work with them to deploy a similar model.

Hi. I’m Alishba.

I’m a 16-year-old Machine Learning, Blockchain developer and researcher.

If you’re working in data science at a hospital or healthcare research institute, I would really appreciate chatting:

Email: alishbai734@gmail.com

Twitter: https://twitter.com/alishbaimran_

Linkedin: https://www.linkedin.com/in/alishba-imran-/

--

--

Machine learning and hardware developer working on accelerating problems in robotics and renewable energy!