Predictive Modeling in Healthcare Part II: Setting up Spark for Jupyter Notebook

In the current series of blog posts, we are building a predictive model to predict which inpatients are at risk of dying. In the previous post we downloaded the Texas publicly available inpatient dataset using wget and imported it into a Jupyter notebook using the pandas read_csv() function.

If you downloaded the Texas inpatient dataset using pandas, you may have noticed that it took a couple of minutes for the data to load. One shortcoming of pandas is that by default it does not run in parallel (although there is a newer implementation of pandas that parallelizes certain pandas operations using vectors. This, however, requires the installation of PyArrow and has been shown to still be slower than applications that run in parallel).

To be able to work with huge datasets that contain hundreds of thousands of rows and hundreds of columns (as healthcare datasets often contain), it is beneficial to move past single-core processing and enter the world of multi-core processing.

In this post, let’s set up Spark to work in our Jupyter notebook. We will then re-import the Texas inpatient data using Spark and benchmark its performance compared to importing via pandas.

Requirements

  • A working installation of Jupyter notebook (Version 5.5.0 was used for this post)
  • A working installation of Anaconda Navigator 3 (Version 1.8.7 was used for this post. This is not an absolute requirement; feel free to install Jupyter and PySpark using alternate means.)
  • JDK>=1.8.0 installed (a requirement for PySpark)
  • This code ran on a Windows PC containing 6GB of RAM. The data file itself is under 1GB.

    Installing Spark for Python

    If you are using Anaconda Navigator, the easiest thing is to click on the “Environments” tab in Navigator and search for pyspark. Once it is found, click the “Apply” button. It should install in a few minutes. Jupyter will be automatically configured to work with Spark using this method.

    If you are using Linux/MacOS and prefer to install PySpark via command line, type the following:

    $ pip install pyspark

    Note that this method will require additional steps to configure Spark to work with Jupyter.

    If you are using Windows and prefer to work via the command line, you’re on your own.

    Verification of Installation

    To verify that the installation worked, open a new Jupyter notebook session and try executing the following:

    from pyspark.sql import SparkSession, SQLContext

    The code should run with no error messages.

    Importing the Texas Inpatient Dataset using Spark and Benchmarking the Import

    Conclusion

    In this post we have improved upon our pandas import of the previous post:

  • We have setup PySpark to work with Jupyter notebook;
  • We have imported the same Texas Inpatient dataset using Spark rather than pandas.
  • We have used Python’s timeit module to confirm that the data import via Spark is orders of magnitude faster than that using pandas.

    Let’s go forward using Spark (since it is so much faster!). In the next post, we will use Spark SQL to describe our data and we will discuss predictive modeling in the face of class imbalance.

    References

    Texas Hospital Inpatient Discharge Public Use Data File, Q1, 2012. Texas Department of State Health Services, Austin, Texas. Date Accessed: 2/3/2019.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *