This repository contains code from the paper Natural Language Generation for Electronic Health Records.
- Keras code for the NRC model.
- Training and testing scripts for the model.
- Example scripts for preprocessing EHR data to be used in the model.
- Install the necessary Python modules (list below)
- Use
preprocessing/sparisfy.py
to convert the discrete variables in your EHRs to sparse format - Use
preprocessing/words_to_integers.py
to convert your free text field to integers - Train the autoencoder on the sparse records with
ae_training.py
- Train the NRC model with
caption_training.py
- Generate text with
caption_testing.py
- Python 3.x
- Keras with the TensorFlow backend
- Pandas, NumPy, h5py, and scikit-learn
The default hyperparameters worked well for the data used in our paper, but they might not for yours, so feel free to experiment! Also, we recommend a GPU for training the captioning model. We used a single NVIDIA Titan X for our experiments, and training with ~2 million records took around 6 hours.