Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Back

HW 3

Description

Unsupervised Discretization:

Write code that takes a table column of N numbers, sorts in, and breaks into bins of size approximately sqrt(N). Note that these breaks have to satisfy the following sanity rules:

  • no range contains too few numbers (sqrt(N));
  • each range is different to the next one by some epsilon value (0.2 * standard deviation of that column);
  • the span of the range (hi - lo) is greater than that epsilon;
  • the lo value of one range is greater than than the hi value of the previous range

Supervised Discretization:

Write code that reflects over the ranges found by the unsupervised discretizer. Combine ranges where some dependent variable is not changed across that combination of ranges. Specifically, sort the ranges and do a recursive descent of the ranges. At each level of the recursion, break the ranges at the point that most minimizes the expected value of the standard deviation of the dependent variable.

Source Files

All source files are present in ./src/ directory

Discretize.py - Main Class for testing
Random.py - Class for Random number generator
Range.py - Class used for Unsupervised discretization
RangeManager.py - Class for Unsupervised discretization
Sample.py - Class to Sample of data for ranges
SuperRange.py - Class for Supervised discretization

Setup

Code has been tested on Python 2.7.12
It uses these libraries: (os, sys, argparse, abc, numpy, math, random, copy)

Usage

To run the code,

python src/Discretize.py

It uses some source files from HW1 and HW2. Those paths have been included wherever required
To ensure that it always runs, please run it from this (/HW3/) directory (using the same command above)

Input Data

It generates random data each time and then tries to discretize it.
It generates 50 data points using the Random Class.
Each of those numbers is mapped to a y value, which is calculated using klass function defined in Discretize.py It assigns one of three values (plus a random componenet) based on the x value.

Output

The first part of the output prints discretized ranges created using just the x values.
The second part of the output prints supervised discretization ranges which are calcualted using the y values as well.
It is observed that the number of supervised ranges are lesser as compared to number of unsupervised ranges.

References

More details about the instructions can be found at Homeworks