Predictive Modeling in Healthcare Part II: Setting up Spark for Jupyter Notebook
February 3, 2019
In the current series of blog posts, we are building a predictive model to predict which inpatients are at risk of dying. In the previous post we downloaded the Texas publicly available inpatient dataset using
wget and imported it into a Jupyter notebook using the
pandas read_csv() function.
If you downloaded the Texas inpatient dataset using
pandas, you may have noticed that it took a couple of minutes for the data to load. One shortcoming of
pandas is that by default it does not run in parallel (although there is a newer implementation of
pandas that parallelizes certain pandas operations using vectors. This, however, requires the installation of PyArrow and has been shown to still be slower than applications that run in parallel).
To be able to work with huge datasets that contain hundreds of thousands of rows and hundreds of columns (as healthcare datasets often contain), it is beneficial to move past single-core processing and enter the world of multi-core processing.
In this post, let’s set up Spark to work in our Jupyter notebook. We will then re-import the Texas inpatient data using Spark and benchmark its performance compared to importing via
This code ran on a Windows PC containing 6GB of RAM. The data file itself is under 1GB.
Installing Spark for Python
If you are using Anaconda Navigator, the easiest thing is to click on the “Environments” tab in Navigator and search for
pyspark. Once it is found, click the “Apply” button. It should install in a few minutes. Jupyter will be automatically configured to work with Spark using this method.
If you are using Linux/MacOS and prefer to install PySpark via command line, type the following:
$ pip install pyspark
Note that this method will require additional steps to configure Spark to work with Jupyter.
If you are using Windows and prefer to work via the command line, you’re on your own.
Verification of Installation
To verify that the installation worked, open a new Jupyter notebook session and try executing the following:
from pyspark.sql import SparkSession, SQLContext
The code should run with no error messages.
Importing the Texas Inpatient Dataset using Spark and Benchmarking the Import
In this post we have improved upon our
pandas import of the previous post:
timeitmodule to confirm that the data import via Spark is orders of magnitude faster than that using
Let’s go forward using Spark (since it is so much faster!). In the next post, we will use Spark SQL to describe our data and we will discuss predictive modeling in the face of class imbalance.
Texas Hospital Inpatient Discharge Public Use Data File, Q1, 2012. Texas Department of State Health Services, Austin, Texas. Date Accessed: 2/3/2019.