This repo hosts variety of examples based on Apache Spark MLIB.
A vanilla decision tree example.
How to get a stratified sample so the test and train datasets are sampled accross possible values.
How to index and encode categorical features.
How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.
How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.
First line from adult.test file removed for loading into Spark.
Census Income data set citation: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.