Synthetic data set generation tools for machine learning experiments. See also pycleandata.
generate.py
: the main data generation scriptoutput/
: empty directory for generated data sets
All configuration options are currently defined and documented within generate.py
(although it is intended that this will change to external configuration files in future versions). Once these are set:
$ python3 generate.py
There is also a Makefile
with targets data
to run generate.py
as above, and clean
to remove generated data sets from output/
. The latter should probably be used with care.
Under output/
, each generated data set has its own directory, with a naming convention based on its configuration. So for a data set named 2_10_1000_r_0.5_004
, in order:
- number of clusters
- number of features
- number of samples
- cardinality (uniform or random)
- within-cluster standard deviation
- index ie. a counter, as we can generate multiple data sets for each configuration
For manageability, generated data sets are grouped into subdirectories based on number of clusters, ie. the current value from iterating OPTS_K
.
Each dataset folder contains:
data.csv
: the data set itselflabels.csv
: the class labels of the data points
Contents of output/
are protected by a .gitignore
file as it is not anticipated that users will commit them to this project on purpose.
- Python 3
- scikit-learn >= 0.20
- numpy
- the ability to run from separate config files, eg. Yaml
- allow more flexible normalisation, eg. pluggable normalisation strategies