Choose the best algorithm by accuracy based on cross validation and folding. For actual test also included relative weights of classes as argument to the classifier under the assumption the dataset is reflective of actual class distributions of unseen instances.
RandomForest White Wine data set 90% training 10% test False Positive Rate: 0.246913580247 False Negative Rate: 0.109756097561 Accuracy: 0.844897959184
- explore.py: how data was analyzed
- evaluate.py: how algorithms were analyzed
- train.py: a solution implementing results of first two
- WEKA 3.7.1 - used as a sanity check to rapidly iterate through ideas in GUI
- WinPython 2.7.6 - basis of python files sent - includes the following packages used
- pandas
- sklearn
- numpy
- matplotlib
- Convert to CSV: replace all ";" with ",".
- Line 2729 replace ",," with "," (extra field)
- Identify instances with attributes whose values may skew learning algorithms (explore.py) - https://onlinecourses.science.psu.edu/stat857/node/223
- Outliers: visualize data with histograms and box plots.
- Correlation: Pearson and Spearman - values close to |1| imply need for feature removal - http://www3.nd.edu/~mclark19/learn/ML.pdf
- Identify possible candidate algorithms and evaluate (evaluate.py):
- Dataset 1. As-is 1. Feature removal 1. Outlier removal
- Record accuracy & runtime.