Create a virtualenv with support for globally installed Python packages, then install Python requirements:
virtualenv v
. v/bin/activate
./compile-pip
Next, install our fork of opencv with Python support:
git submodule update --init
./build_opencv.sh
(Be sure to run this with the virtualenv enabled)
Now you're set up and ready to collect samples, train, and test models!
To go from sensor data collected by Motion app to a build model, just run the
all
command with appropriate flags:
$ python pipeline.py all -p android --no-split -f
$ python pipeline.py all -p ios --no-split -f
Models are then present in ./models/android
and ./models/ios
.
You can also do each step individually; for example, to build a new Android model:
$ python pipeline.py updateSamples -p android --no-split -f
$ python pipeline.py updateFeatures -p android --no-split -f
$ python pipeline.py train -p android --no-split -f
Each step is explained in more detail below.
Data obtained from the "Motion" app is called "classification data" and is
stored in JSON files in the data
directory. For dumb reasons, they're called
data/classification_data.<integer>.jsonl
:
$ ls -l data/classification_data.*.jsonl
-rw-rw-r-- 1 evan evan 221565 Mar 15 13:41 data/classification_data.14.jsonl
-rw-rw-r-- 1 evan evan 1053042 Mar 15 13:41 data/classification_data.15.jsonl
-rw-rw-r-- 1 evan evan 4846198 Mar 15 13:41 data/classification_data.16.jsonl
...
These are raw traces produced by the Motion app. The updateSamples
command
processes these files, producing for each a derivative called, for example
[...].16.jsonl.ciSamples.pickle
. This file contains an array of
ContinuousIntervalSample
objects. Each object contains:
- a list of accelerometer readings at sufficient frequency with no gaps,
- some speed data aligned to the same time axis, and
- some metadata about the sample.
Each sample object is limited to a maximum length of MAX_INTERVAL_LENGTH
, 60
seconds at this writing.
The default behavior of updateSamples
is to only process new data (based on
file modification times). If the -f
option ("force") is specified, each input
file is processed without exception.
In the updateFeatures
command, each ContinuousIntervalSample
is transformed
to a LabeledFeatureSet
. We use a rolling window and the feature creation
function prepareFeaturesFromSignal
provided by our C++ library.
The feature creation function varies slightly between platforms, so this step needs to be run once for each platform. (due to alternate FFT implementations)
The resulting object is saved to a location called, for example,
[...].ciSamples.pickle.fsets.android.pickle
.
The default behavior of updateFeatures
is to only process new data (based on
file modification times). If the -f
option ("force") is specified, each input
file is processed without exception.
Two parameters are used for feature computation:
--sample_count
: Number of samples required; must be a power of 2 for FFT features. Default:64
--sampling_rate_hz
: Frequency of samples, in samples per second. Default:21
If a config file is specified with -c <config.json>
, values from that config
file will be used instead of the command line options.
After modifying C++ code, you just need to rebuild the shared library and then
re-run updateFeatures -f
:
make
python pipeline.py updateFeatures -f
The training step uses labeled features to train a random forest.
Destination: Set the output directory and set default configuration parameters
by specifying --config <.../config.json>
. Model build
Splitting: If you are testing a forest, you can split the feature sets into
train & test chunks. This is the default behavior; the splitting is random by
default but you can control the split with the -s --seed
option. To disable
splitting for a production model, use --no-split
.
Using a subset of data: You can exclude certain types of motion by supplying a
comma-separated list of labels to the --exclude-labels
options. For example,
use --exclude-labels 1,9
to exclude running and my dumb tram data. The default
excludes label 9.
Don't use TSD samples: TSD samples are promising but can produce bad models.
I recommend specifying --exclude-crowd-data
until we get better at
pre-processing TSDs.
Fast iteration: Sometimes you want the training to finish faster at the
expense of prediction accuracy. Use --sample-fraction
to specify a value
between 0 and 1 to train on a random subset of the feature sets. Use
-s --seed
to keep the subset the same between runs, if that matters to you.
--train-sample-count-multiple
: Minimum number of samples in a decision node, expressed as a fraction of total available samples. Default:0.0005
--train-active-var-count
: Number of variables that can be used for a decision node. The value0
means "square root of the number of features." Default:0
.--train-max-tree-count
: Maximum number of trees in final model. The primary termination criteria. Default:10
--train-epsilon
: Another termination criteria. Default:0.0001
.
-p --platform
: Name of target platform,ios
orandroid
. AffectsupdateFeatures
andtrain
andtest
.-f --force
: Force updating derivatives--exclude-labels <1,2,3>
: Exclude specified classes of motion from training, used only intrain
command.--no-split
: Disable default train/test split-c --config
: Load configuration parameters from specified file. Overrides command-line parameters.-s --seed <value>
: Seed for random number generator used in splitting or in--sample-fraction
. Any string value is accepted.--sample-fraction
: Train on a fraction of available samples.--sample-count
: Number of readings used in feature creation. AffectsupdateFeatures
andtest
.--sampling-rate-hz
: Frequency of accelerometer readings. Training data will be resampled to this frequency.--use-threads
: Parallelize with threads. Not generally recommended; the default uses processes which is faster for most things.-P --production
: Build to the production model location--train-sample-count-multiple
: Seetrain
.--train-active-var-count
: Seetrain
.--train-max-tree-count
: Seetrain
.--train-epsilon
: Seetrain
.--exclude-crowd-data
: Disable TSDs during training. Recommended.