Predictive Modeling in Healthcare Part I: Downloading the Texas Public Use Inpatient Data

In this series of blog posts, I will describe how to build a predictive model to predict which patients admitted to the hospital are at risk of dying.

In this first post, let’s download our healthcare dataset: the Texas public use data files.

One of the obstacles for predictive modeling in healthcare is that good healthcare data is hard to find. Fortunately, a little-known fact is that the Texas government has made millions of inpatient records public. These records have been de-identified and are available at the following website. Downloading the data first requires accepting the Data Use Agreement.

Requirements

  • Working installation of wget (see instructions in next section)
  • Working installation of JDK (to unzip the file)
  • Working installation of Jupyter notebook
  • The first and second items can be bypassed if you manually download and unzip the file yourself.

    This code ran on a Windows PC containing 6GB of RAM. The data file itself is under 1GB.

    Installing wget

    Because working as a professional data scientist often requires downloading data and tools automatically, let’s do that. We will use the wget program to download the data; it is available for download for Windows, Mac, and Linux.

    If you are using Windows, first navigate to the above link to download wget. Take note of the installation directory. When wget is finished installing, open the Windows command prompt and add the installation directory to your path:

    > set path=%PATH%;"C:\Program Files (x86)\GnuWin32\bin"

    To confirm that the command prompt can now access wget type this line:

    > wget -h

    Downloading the Texas inpatient dataset

    Now that you have wget working, let’s use it to download the 1st tab-delimited data file for Quarter 1 of 2012. At the command prompt, type the following:

    > wget --no-check-certificate --content-disposition
    https://www.dshs.texas.gov/thcic/hospitals/Data/PUDF_base1_1q2012_tab/

    Now you must unzip the file. To do so at the command line, you must have JDK installed and in the path. Type the following:

    > jar xf PUDF_base1_1q2012_tab.zip

    If you use the dir command in the same directory you should now see that a file named “PUDF_base1_1q2012_tab.txt” exists.

    Importing the data into your Jupyter notebook session

    After opening Jupyter and starting a new notebook, type the following:

    Conclusion

    In this post we have downloaded freely available inpatient clinical data and imported it into a Jupyter session.

    In the next post we will describe the dataset in more detail and use some Python commands to explore the dataset.

    References

    Texas Hospital Inpatient Discharge Public Use Data File, Q1, 2012. Texas Department of State Health Services, Austin, Texas. Date Accessed: 2/1/2019.

    Leave a Reply

    Your email address will not be published. Required fields are marked *