Predictive Modeling in Healthcare Part I: Downloading the Texas Public Use Inpatient Data
February 1, 2019
In this series of blog posts, I will describe how to build a predictive model to predict which patients admitted to the hospital are at risk of dying.
In this first post, let’s download our healthcare dataset: the Texas public use data files.
One of the obstacles for predictive modeling in healthcare is that good healthcare data is hard to find. Fortunately, a little-known fact is that the Texas government has made millions of inpatient records public. These records have been de-identified and are available at the following website. Downloading the data first requires accepting the Data Use Agreement.
wget(see instructions in next section)
The first and second items can be bypassed if you manually download and unzip the file yourself.
This code ran on a Windows PC containing 6GB of RAM. The data file itself is under 1GB.
Because working as a professional data scientist often requires downloading data and tools automatically, let’s do that. We will use the
wget program to download the data; it is available for download for Windows, Mac, and Linux.
If you are using Windows, first navigate to the above link to download
wget. Take note of the installation directory. When
wget is finished installing, open the Windows command prompt and add the installation directory to your path:
> set path=%PATH%;"C:\Program Files (x86)\GnuWin32\bin"
To confirm that the command prompt can now access
wget type this line:
> wget -h
Downloading the Texas inpatient dataset
Now that you have
wget working, let’s use it to download the 1st tab-delimited data file for Quarter 1 of 2012. At the command prompt, type the following:
> wget --no-check-certificate --content-disposition
Now you must unzip the file. To do so at the command line, you must have JDK installed and in the path. Type the following:
> jar xf PUDF_base1_1q2012_tab.zip
If you use the
dir command in the same directory you should now see that a file named “PUDF_base1_1q2012_tab.txt” exists.
Importing the data into your Jupyter notebook session
After opening Jupyter and starting a new notebook, type the following:
In this post we have downloaded freely available inpatient clinical data and imported it into a Jupyter session.
In the next post we will describe the dataset in more detail and use some Python commands to explore the dataset.
Texas Hospital Inpatient Discharge Public Use Data File, Q1, 2012. Texas Department of State Health Services, Austin, Texas. Date Accessed: 2/1/2019.