Final Project: Learning to Diagnose using Clinical Notes

Video Link: https://youtu.be/Kv79W96W6w0

Members:

Jinmiao Huang
Cesar Osorio
Luke Wicent Sy

Environment Setup (local)

conda env create -f environment.yml
Install spark or download spark binary from here
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
- the above command should install toree. if it fails, refer to github link.
- note that toree was not included in environment.yml because including it there didn't work for me before.
jupyter toree install --user --spark_home=/spark-2.1.0-bin-hadoop2.7 --interpreters=PySpark
Extract the ff. files to the directory "code/data":
- DIAGNOSES_ICD.csv (from bd4h piazza)
- NOTEEVENTS-2.csv (cleaned version of NOTEEVENTS.csv) Luke's Google Drive
- D_ICD_DIAGNOSES.csv (from bd4h piazza)
- model_word2vec_v2_*dim.txt Cesar's Google Drive
- bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)
- model_doc2vec_v2_*dim_final.csv Cesar's Google Drive
To run data preprocessing, data statistics, and ipynb related stuff, start the jupyter notebook. Don't forget to set the kernel to "Toree Pyspark".
- jupyter notebook
To run the deep learning experiments, follow the corresponding guide below.

Environment Setup (azure)

Setup Docker w/ GPU following this guide
Using Azure's portal, select the vm's firewall (in my case, it showed "azure01-firewall" in "all resources"), then "allow" port 22 (ssh) and 8888 (jupyter) for both inbound and outbound.
You can ssh the VM through one of the ff:
- docker-machine ssh azure01
- ssh docker-user@public_ip_addr
Spark can be installed by following the instructions in "Environment Setup (local), but note that this will not be as powerful as HDInsights. I recommend taking advantage of the VM's large memory by setting the spark memory to a higher value (/conf/spark-defaults.conf)
If you have a jupyter notebook running in this VM, you can access via http://public_ip_addr:8888/
To enable the GPUs for deep learning, follow the instructions in the tensorflow website link
- you can check the GPUs' status by "nvidia-smi"

Folder Structure

code: all source code
- data: you should download and extract the data mentioned in "Environment Setup (local)"
report: latex source for this study's paper

General Pipeline

(optional) cleaned NOTEEVENTS.csv using postgresql. imported NOTEEVENTS.csv by modifying mimic iii github and using the commands "select regexp_replace(field, E'[\n\r]+', ' ', 'g' )". the cleaned version (NOTEEVENTS-2.csv) can be downloaded in the google drive mentioned in "Environment Setup (local)"
run preprocess.ipynb to produce DATA_HADM and DATA_HADM_CLEANED.
run describe_icd9code.ipynb and describe_icd9category.ipynb to produce the descriptive statistics.
run feature_extraction_seq.ipynb and feature_extraction_nonseq.ipynb to produce the input features for the machine learning and deep learning classifiers.
run ml_baseline.py to get the results for Logistic Regression and Random Forest.
run nn_baseline_train.py and nn_baseline_test.py to get the results for Feed-Forward Neural Network.
run wordseq_train.py and wordseq_test.py to get the results for Conv1D, RNN, LSTM and GRU (refer to 'help' or the guide below on training and testing for Keras Deep Learning Models)

Training and Testing for Feed Forward Neural Network

Prerequirest: Keras + Tensorflow, or Keras + Theano
models are specified in nn_baseline_models.py
run nn_baseline_preprocessing to prapare the data for training and testing use.
Training:
- You can also run training with default arguments: pythno nn_baseline_train.py,
- Or run training script with customized input arguments: python nn_baseline_train.py --epoch 10 --batch_size 128 --model_name nn_model_1 --pre_train False
- Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments
Testing:
- Test model with default model and data file: python tfidf_test.py
- Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments

Training and Testing for Recurrent and Convolution Neural Network

Similar to Feed Forward Neural Network, users can run the training and tesing with the default settings in wordseq_train.py and wordseq_test.py. All the model architectures are specified in wordseq_models.py

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
code		code
report		report
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project: Learning to Diagnose using Clinical Notes

Environment Setup (local)

Environment Setup (azure)

Folder Structure

General Pipeline

Training and Testing for Feed Forward Neural Network

Training and Testing for Recurrent and Convolution Neural Network

About

Releases

Packages

Contributors 2

Languages

Cesaros/DeepLearning_ClinicalNotes

Folders and files

Latest commit

History

Repository files navigation

Final Project: Learning to Diagnose using Clinical Notes

Environment Setup (local)

Environment Setup (azure)

Folder Structure

General Pipeline

Training and Testing for Feed Forward Neural Network

Training and Testing for Recurrent and Convolution Neural Network

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages