This repository holds code for automatic regression modelling, specifically ordinal logistic regression.
The aim is to create explanatory models with significant variables as defined by the p-values and confidence intervals of their regression coefficients.
This algorithm has been presented at the 36th Congress of the European Society for Radiotherapy and Oncology
Citation: Christophides D, Appelt AL, Lilley J, Sebag-Montefiore D. PO-0853: A method for automatic selection of parameters in NTCP modelling. Radiother. Oncol. 2017;123:S463-S464. doi:10.1016/S0167-8140(17)31290-2
To run the code you will need both Python and R installed
The code was developed using Python 2.7.13 (64bit) with the following modules installed:
pandas 0.19.2
scikit-learn 0.17.1
numpy 1.11.3
rpy2 2.7.8
matplotlib 2.0.0
tqdm 4.11.2
statsmodels 0.6.1
scipy 0.18.1
You will also need R installed:
R 3.3.1 was used in this project
with the 'VGAM' and 'HandTill2001' libraries
I recommend installing rpy2 from
Select the rpy2‑2.7.8‑cp27‑none‑win_amd64.whl file to install
Also you will need to add these paths (environment variables) to your user account
For example: (change Values to match your configuration)
Variable: R_HOME Values: C:\Program Files\R\R-3.3.1
Variable: R_USER Values: C:\Users\username\Anaconda2\Lib\site-packages\rpy2
If you don't have admin access to your PC type "Edit environment variables for your account" into the Windows start menu to change your user account paths without admin privileges
Alternatively you can temporarily set paths from within Python using
import os
os.environ['R_HOME'] = 'C:/xxxxxxxx/R/R-3.2.2'
os.environ['R_USER'] = 'C:/xxxxxxxxxxxxxx/Anaconda2/Lib/site-packages/rpy2'
you might need to place this command at the top of every .py file that calls rpy2
Flowchart of the algorithm with some default values:
The main routine is run as follows:
best_model, series_models, series_vars = MainAlgorithm.main_ga(df=data, trgt=target, n_boot=N_GA_BOOT, n_pop=N_GA_POP, ratio_retain=RETAIN, ratio_mut=MUTATE, min_metric=OPT_METRIC, min_gener=MIN_GEN, tol_gen=TOLE_N_GEN, max_gener=MAX_GEN, mult_thrd=MULTI_THREAD, vif_value=VIF_VALUE)
The output is best_model: python list with the best model parameters from the algorithm, series_models: a pandas series with all the models generated in order of percentage of selection, series_vars: a pandas series with all the variables selected, by all the models, in order of percentage of selection
It is recommended to go through the code and comments in to understand how the algorithm is implemented
The ordinal response dataset on wine quality was downloaded from and tested for the example shown in ''
As you can see the automatically generated model has only significant variables (p-value<0.05)
Also a list of the percentage the variables were selected during the algorithm run was produced:
As well as the order of percentage selection of the models:
Contact Damianos Christophides at