Skip to content

Application of deep learning for natural language processing of clinical notes

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Final Project: Learning to Diagnose using Clinical Notes

Video Link:


  • Jinmiao Huang
  • Cesar Osorio
  • Luke Wicent Sy

Environment Setup (local)

  1. conda env create -f environment.yml
  2. Install spark or download spark binary from here
  3. pip install
    • the above command should install toree. if it fails, refer to github link.
    • note that toree was not included in environment.yml because including it there didn't work for me before.
  4. jupyter toree install --user --spark_home=/spark-2.1.0-bin-hadoop2.7 --interpreters=PySpark
  5. Extract the ff. files to the directory "code/data":
    • DIAGNOSES_ICD.csv (from bd4h piazza)
    • NOTEEVENTS-2.csv (cleaned version of NOTEEVENTS.csv) Luke's Google Drive
    • D_ICD_DIAGNOSES.csv (from bd4h piazza)
    • model_word2vec_v2_*dim.txt Cesar's Google Drive
    • bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)
    • model_doc2vec_v2_*dim_final.csv Cesar's Google Drive
  6. To run data preprocessing, data statistics, and ipynb related stuff, start the jupyter notebook. Don't forget to set the kernel to "Toree Pyspark".
    • jupyter notebook
  7. To run the deep learning experiments, follow the corresponding guide below.

Environment Setup (azure)

  1. Setup Docker w/ GPU following this guide
  2. Using Azure's portal, select the vm's firewall (in my case, it showed "azure01-firewall" in "all resources"), then "allow" port 22 (ssh) and 8888 (jupyter) for both inbound and outbound.
  3. You can ssh the VM through one of the ff:
    • docker-machine ssh azure01
    • ssh docker-user@public_ip_addr
  4. Spark can be installed by following the instructions in "Environment Setup (local), but note that this will not be as powerful as HDInsights. I recommend taking advantage of the VM's large memory by setting the spark memory to a higher value (/conf/spark-defaults.conf)
  5. If you have a jupyter notebook running in this VM, you can access via http://public_ip_addr:8888/
  6. To enable the GPUs for deep learning, follow the instructions in the tensorflow website link
    • you can check the GPUs' status by "nvidia-smi"

Folder Structure

  • code: all source code
    • data: you should download and extract the data mentioned in "Environment Setup (local)"
  • report: latex source for this study's paper

General Pipeline

  1. (optional) cleaned NOTEEVENTS.csv using postgresql. imported NOTEEVENTS.csv by modifying mimic iii github and using the commands "select regexp_replace(field, E'[\n\r]+', ' ', 'g' )". the cleaned version (NOTEEVENTS-2.csv) can be downloaded in the google drive mentioned in "Environment Setup (local)"
  2. run preprocess.ipynb to produce DATA_HADM and DATA_HADM_CLEANED.
  3. run describe_icd9code.ipynb and describe_icd9category.ipynb to produce the descriptive statistics.
  4. run feature_extraction_seq.ipynb and feature_extraction_nonseq.ipynb to produce the input features for the machine learning and deep learning classifiers.
  5. run to get the results for Logistic Regression and Random Forest.
  6. run and to get the results for Feed-Forward Neural Network.
  7. run and to get the results for Conv1D, RNN, LSTM and GRU (refer to 'help' or the guide below on training and testing for Keras Deep Learning Models)

Training and Testing for Feed Forward Neural Network

  • Prerequirest: Keras + Tensorflow, or Keras + Theano

  • models are specified in

  • run nn_baseline_preprocessing to prapare the data for training and testing use.

  • Training:

    • You can also run training with default arguments: pythno,
    • Or run training script with customized input arguments: python --epoch 10 --batch_size 128 --model_name nn_model_1 --pre_train False
    • Please refer to parse_args() function in for the full list of the input arguments
  • Testing:

    • Test model with default model and data file: python
    • Please refer to parse_args() function in for the full list of the input arguments

Training and Testing for Recurrent and Convolution Neural Network

  • Similar to Feed Forward Neural Network, users can run the training and tesing with the default settings in and All the model architectures are specified in


Application of deep learning for natural language processing of clinical notes






No releases published


No packages published