Name	Name	Last commit message	Last commit date
Latest commit History 2,166 Commits
annotation	annotation
datasets	datasets
features	features
load_dir	load_dir
models	models
preprocessing	preprocessing
production	production
tests	tests
train_dir	train_dir
training	training
visualize	visualize
.gitmodules	.gitmodules
Dockerfile	Dockerfile
README.md	README.md
mac.txt	mac.txt
requirements.txt	requirements.txt
settings.json	settings.json
setup.py	setup.py

Allie

Allie is a framework for building machine learning models from audio, text, image, video, or .CSV files.

Here are some things that Allie can do:

annotate audio, text, image, or video files (via default annotation scripts)
featurize files and export data in .CSV format (via audio, text, image, video, or csv featurizers)
transform features (via scikit-learn preprocessing techniques)
create visualizations from featurized datasets (via yellowbrick, scikit-learn, and matplotlib libraries)
train machine learning models (via tpot, hyperopt, scsr, devol, keras, ludwig, and 15 other training scripts)
make predictions from machine learning models (with all models trained in ./models directory)
prepare compressed machine learning models for deployment (including repositories with readmes)

You can read more about Allie in the wiki documentation.

active things to finish before a live launch [ongoing list]

ongoing

add error handling into all of Allie's featurizations + error array into feature array itself ("error" form of column on features)
solve bug relating to regression problems in the visualize.py script (this does not work for regression)
solve regression problem loading machine learning models and making predictions (from spreadsheets)
add in notion of 'saving' datasets in the ./datasets directory in d3m format, .JSON format, and .CSV file format (and upload these into the cloud on S3 or an FTP server)
create time_split type of setting (for audio and video files) in annotation
create live version of annotation script for audio, text, image, and video files (and add-in default_audio_annotators, default_text_annotators, default_image_annotators, and/or default_video_annotators into settings.json)
clean up datasets folder --> cleaning dir / augmentation dir (these can change to main directory tree), change labeling directory to annotation in main directory
create single-file annotation mode (instead of folders)
create single-file prediction mode (instead of folders)
create single-file featurization mode (instead of folders)
add single-file cleaning mode (instead of folders)
add single-file augmentation mode (instead of folders)
tie new datasets with SurveyLex product / CLI interface with downloads
{class: {value: value}} prediction / only allow for csv files for training (get regression model prediction working)
add in default_augmenters / get live into Allie
add in default_cleaners / get live into Allie
create docker containers for production for any arbitrary data type / specify to AWS, GCP, or Azure deployment (in marketplaces) / Flask with Auth0 integration for custom APIs (submit file --> get back model results)
make sure Allie passes all tests on linux, etc. / contextualize tests around default settings
add new test cases into Allie / make tests work with new framework
enhance visualizers with audio (RMS power/25 samples), text (freqdist plot), image, video, and csv-specific analyses
documentation of the repository / jupyter notebooks with examples in research paper
Create nice CLI interface for all of Allie's functionality using OptionParser()
fix bugs associated with model loading with different model types (may require different configurations)
add in statsmodels and MLpy dimensionality reduction techniques and modeling techniques
add in augmentation policies into visualizer to show which augmentation methods work to increase AUC / MSE
add in cleaning policies into visualizer to show which cleaning methods work to increase AUC / MSE
add in both cleaning and augmentation policies (in combinatoric fashion) to show which combinations work best for AUC / MSE
use combinatoric policies to select optimal model from configurations (clean, augmentation, preprocessing techniques, etc.); train_combinatorics.py (new script idea)
add in new ASR: https://github.com/rolczynski/Automatic-Speech-Recognition

recently completed (version 1.0.0 release)

added audio_features/loudness_features.py using pyloudnorm (in dB)
cleaned up audio_features/sa_feature array to be a simpler # of lines (and made a fixed length-array)
fixed bug in loading AutoGluon models for making predictions with the load.py script in the ./models/ directory (and loading model_type variable generally)
add in ['zscore','isolationforest'] to remove outliers (https://stackoverflow.com/questions/51390196/how-to-calculate-cooks-distance-dffits-using-python-statsmodel) - remove_outliers == True / False.
added a sample validation script in the ./models directory to quickly assess how well machine learning models generalize to new datasets
added Figlet for cool text renderings / messages when loading modeling scripts (http://www.figlet.org/)
bug fix - minor bug fix in visualize.py script; fixed loading broken .JSON files during featurization (broke the visualization script during model training)
bug fix - edited transforms such that they are named by the common name and not by all the classes trained, as if you have >30 classes this will cause the transform to fail at saving / loading
added option in modeling script to create csv files (if create_csv == True, then creates .CSV files during training) - note the reason for this is for very large files it can take a long time to create them, so model training sessions can be sped up by setting create_csv == False.
added annotate.py script to annotate files (beta version) - need to add to .JSON schema (in labels (regression)
come up with the ability to train regression models by a class and value
add in single model prediction mode in ./load.py script (-audio (sampletype) -c_autokeras (folder) -directory)
add in all model loaders from the model trainers
fixed cvopt and autokaggle training script bugs
added in the ability to quickly visualize ML models trained in a spreadhseet with the model2csv.py script
bug fix - minor bug fixes associated with transcription during featurization for audio, image, video, and .CSV files
add notion of "tabular" data instead of .CSV to tie to audio, video, and image data (e.g. for loading datasets) - as laid out in the d3m-schema - did this in the featurize_csv script where .CSV files can contain audio, text, image, video, numerical, and categorical data.
test and validate model compression works for all training scripts / can load compressed models and make predictions (w/ production)
finish up model trainers and clean them up with standard metrics for accuracy
add in version to Allie (to assess deprecation issues into the future)
add in deepspeech functionality to transcription for open source (and other open source audio transcribers)
add in transcriber settings as a list ['pocketsphinx', 'deepspeech', 'google', 'aws'], etc.
added in transcribers as lists (can be adapted into future)
created a version 2 trainer for machine learning models (as part of Allie release 1.0.0)

getting started

MacOS

First, clone the repository:

git clone --recurse-submodules -j8 git@github.com:jim-schwoebel/allie.git
cd allie

Set up virtual environment (to ensure consistent operating mode across operating systems).

python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate

Now install required dependencies and perform unit tests to make sure everything works:

python3 setup.py

Now you can run some unit tests:

cd tests
python3 test.py

Note the unit tests above takes roughly 30-40 minutes to complete and makes sure that you can featurize, model, and load model files (to make predictions) via your default featurizers and modeling techniques. It may be best to go grab lunch or coffee while waiting. :-)

Windows 10

recommended installation (Docker)

You can run Allie in a Docker container fairly easily (10-11GB container run on top of Linux/Ubuntu):

git clone --recurse-submodules -j8 git@github.com:jim-schwoebel/allie.git
cd allie 
docker build -t allie_image .
docker run -it --entrypoint=/bin/bash allie_image
cd ..

You will then have access to the docker container to use Allie's folder structure. You can then run tests @

cd tests
python3 test.py

Note you can quickly download datasets from AWS buckets and train machine learning models from there.

alternative

Note that there are many incomptible Python libraries with Windows, so I encourage you to instead run Allie in a Docker container with Ubuntu or on Windows Subsystem for Linux.

If you still want to try to use Allie with Windows, you can do so below.

First, install various dependencies:

Download Microsoft Visual C++ (https://www.visualstudio.com/thank-you-downloading-visual-studio/?sku=BuildTools&rel=15).
Download SWIG and compile locally as an environment variable (http://www.swig.org/download.html).
Follow instructions to setup Tensorflow on Windows.

Now clone Allie and run the setup.py script:

git clone --recurse-submodules -j8 git@github.com:jim-schwoebel/allie.git
git checkout windows
cd allie 
python3 -m pip install --user virtualenv
python3 -m venv env
python3 setup.py

Note that there are some functions that are limited (e.g. featurization / modeling scripts) due to lack of Windows compatibility.

Linux

Here are the instructions for setting up Allie on Linux:

git clone --recurse-submodules -j8 git@github.com:jim-schwoebel/allie.git
git checkout linux
cd allie 
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
python3 setup.py

Now you can run some unit tests:

cd tests
python3 test.py

folder structures

Here is a table that describes the folder structure for this repository. These descriptions could help guide how you can quickly get started with featurizing and modeling data samples.

folder name	description of folder
datasets	an elaborate list of open source datasets that can be used for curating, annotating, cleaning, and augmenting datasets.
features	a list of audio, text, image, video, and csv featurization scripts (defaults can be specified in the settings.json files).
load_dir	a directory where you can put in audio, text, image, video, or .CSV files and make model predictions from ./models directory.
models	for loading/storing machine learning models and making model predictions for files put in the load_dir.
production	a folder for outputting production-ready repositories via the YAML.py script.
tests	for running local tests and making sure everything works as expected.
train_dir	a directory where you can put in audio, text, image, video, or .CSV files in folders and train machine learning models from the model.py script in the ./training/ directory.
training	for training machine learning models via specified model training scripts.
visualize	for visualizing and selecting features as part of the model creation process.

standard feature array

After much trial and error, this standard feature array schema seemed the most appropriate for defining data samples (audio, text, image, video, or CSV samples):

def make_features(sampletype):

	# only add labels when we have actual labels.
	features={'audio':dict(),
		  'text': dict(),
		  'image':dict(),
		  'video':dict(),
		  'csv': dict()}

	transcripts={'audio': dict(),
		     'text': dict(),
		     'image': dict(),
		     'video': dict(),
		     'csv': dict()}

	models={'audio': dict(),
		 'text': dict(),
		 'image': dict(),
		 'video': dict(),
		 'csv': dict()}

	data={'sampletype': sampletype,
	      'transcripts': transcripts,
	      'features': features,
	      'models': models,
	      'labels': [],
	      'errors': []}

	return data

There are many advantages for having this schema including:

sampletype definition flexibility - flexible to 'audio' (.WAV / .MP3), 'text' (.TXT / .PPT / .DOCX), 'image' (.PNG / .JPG), 'video' (.MP4), and 'csv' (.CSV). This format can also can adapt into the future to new sample types, which can also tie to new featurization scripts. By defining a sample type, it can help guide how data flows through model training and prediction scripts.
transcript definition flexibility - transcripts can be audio, text, image, video, and csv transcripts. The image and video transcripts use OCR to characterize text in the image, whereas audio transcripts are transcipts done by traditional speech-to-text systems (e.g. Pocketsphinx). You can also add multiple transcripts (e.g. Google and PocketSphinx) for the same sample type.
featurization flexibility - many types of features can be put into this array of the same data type. For example, an audio file can be featurized with 'standard_features' and 'praat_features' without really affecting anything. This eliminates the need to re-featurize and reduces time to sort through multiple types of featurizations during the data cleaning process.
label annotation flexibility - can take the form of ['classname_1', 'classname_2', 'classname_N...'] - classification problems and [{classname1: 'value'}, {classname2: 'value'}, ... {classnameN: 'valueN'}] where values are between [0,1] for regression problems.
model predictions - one survey schema can be used for making model predictions and updating the schema with these predictions. Note that any model that is used for training can be used to make predictions in the load_dir.
visualization flexibility - can easily visualize features of any sample tpye through Allie's visualization script (e.g. tSNE plots, correlation matrices, and more).
error tracing - easily trace errors associated with featurization and/or modeling to review what is happening during a session.

This schema is inspired by D3M-schema by the MIT media lab.

We are currently in process to implement this schema into the SurveyLex architecture.

easy data exports

Easily featurize and export data in .CSV format for porting data across ML platforms. This is useful for benchmarking and curating datasets that are repeatable.

Show example of this here

settings

Settings can be modified in the settings.json file. If no settings.json file is identified, it will automatically be created with some default settings from the setup.py script.

Here are some settings that you can modify in this settings.json file and the various options for these settings:

setting	description	default setting	all options
default_audio_features	default set of audio features used for featurization (list).	["standard_features"]	["audioset_features", "audiotext_features", "librosa_features", "meta_features", "mixed_features", "opensmile_features", "praat_features", "prosody_features", "pspeech_features", "pyaudio_features", "pyaudiolex_features", "sa_features", "sox_features", "specimage_features", "specimage2_features", "spectrogram_features", "speechmetrics_features", "standard_features"]
default_text_features	default set of text features used for featurization (list).	["nltk_features"]	["bert_features", "fast_features", "glove_features", "grammar_features", "nltk_features", "spacy_features", "text_features", "w2v_features"]
default_image_features	default set of image features used for featurization (list).	["image_features"]	["image_features", "inception_features", "resnet_features", "squeezenet_features", "tesseract_features", "vgg16_features", "vgg19_features", "xception_features"]
default_video_features	default set of video features used for featurization (list).	["video_features"]	["video_features", "y8m_features"]
default_csv_features	default set of csv features used for featurization (list).	["csv_features"]	["csv_features"]
transcribe_audio	determines whether or not to transcribe an audio file via default_audio_transcriber (boolean).	True	True, False
default_audio_transcriber	the default audio transcriber if transcribe_audio == True (list).	['pocketsphinx']	['pocketsphinx', 'deepspeech_nodict', 'deepspeech_dict', 'google', 'wit', 'azure', 'bing', 'houndify', 'ibm']
transcribe_text	determines whether or not to transcribe a text file via default_text_transcriber (boolean).	True	True, False
default_text_transcriber	the default text transcriber if transcribe_text == True (list).	['raw text']	['raw text']
transcribe_image	determines whether or not to transcribe an image file via default_image_transcriber (boolean).	True	True, False
default_image_transcriber	the default image transcriber if transcribe_image == True (list).	['tesseract']	['tesseract']
transcribe_video	determines whether or not to transcribe a video file via default_video_transcriber (boolean).	True	True, False
default_video_transcriber	the default video transcriber if transcribe_video == True (boolean).	['tesseract_connected_over_frames']	['tesseract_connected_over_frames']
transcribe_csv	determines whether or not to transcribe a csv file via default_csv_transcriber (boolean).	True	True, False
default_csv_transcriber	the default video transcriber if transcribe_csv == True (list).	['raw text']	['raw text']
default_training_script	the specified traning script(s) to train machine learning models. Note that if you specify multiple training scripts here that the training scripts will be executed serially (list).	['tpot']	['alphapy', 'atm', 'autogbt', 'autokaggle', 'autokeras', 'auto-pytorch', 'btb', 'cvopt', 'devol', 'gama', 'hyperband', 'hypsklearn', 'hungabunga', 'imbalance-learn', 'keras', 'ludwig', 'mlblocks', 'neuraxle', 'safe', 'scsr', 'tpot']
clean_data	specifies whether or not you'd like to clean / pre-process data in folders before model training (boolean).	True	True, False
default_audio_cleaners	the specified cleaning scripts to employ when cleaning audio data	['remove_duplicates']	['remove_duplicates']
default_text_cleaners	the specified cleaning scripts to employ when cleaning text data	['remove_duplicates']	['remove_duplicates']
default_image_cleaners	the specified cleaning scripts to employ when cleaning image data	['remove_duplicates']	['remove_duplicates']
default_video_cleaners	the specified cleaning scripts to employ when cleaning video data	['remove_duplicates']	['remove_duplicates']
default_csv_cleaners	the specified cleaning scripts to employ when cleaning csv data	['remove_duplicates']	['remove_duplicates']
augment_data	specifies whether or not you'd like to augment data during training (boolean).	False	True, False
default_audio_augmenters	the specified cleaning scripts to employ when augmenting audio data	['normalize_volume', 'add_noise', 'time_stretch']	['normalize_volume', 'normalize_pitch', 'time_stretch', 'opus_enhance', 'trim_silence', 'remove_noise', 'add_noise']
default_text_augmenters	the specified cleaning scripts to employ when augmenting text data	[]	[]
default_image_augmenters	the specified cleaning scripts to employ when augmenting image data	[]	[]
default_video_augmenters	the specified cleaning scripts to employ when augmenting video data	[]	[]
default_csv_augmenters	the specified cleaning scripts to employ when augmenting csv data	[]	[]
reduce_dimensions	if True, reduce dimensions via the default_dimensionality_reducer (or set of dimensionality reducers)	False	True, False
default_dimensionality_reducer	the default dimensionality reducer or set of dimensionality reducers	["pca"]	["pca", "lda", "tsne", "plda","autoencoder"]
select_features	if True, select features via the default_feature_selector (or set of feature selectors)	False	True, False
default_feature_selector	the default feature selector or set of feature selectors	["lasso"]	["lasso", "rfe"]
scale_features	if True, scales features via the default_scaler (or set of scalers)	False	True, False
default_scaler	the default scaler (e.g. StandardScalar) to pre-process data	["standard_scaler"]	["binarizer", "one_hot_encoder", "normalize", "power_transformer", "poly", "quantile_transformer", "standard_scaler"]
create_YAML	specifies whether or not you'd like to output a production-ready repository for model deployment (boolean).	False	True, False
create_csv	if True creates .CSV files during model training and puts them in the ./data folder in the machine learning model directory; note if set to False this can speed up model training.	True	True, False
model_compress	if True compresses the model for production purposes to reduce memory consumption. Note this only can happen on Keras or scikit-learn / TPOT models for now (boolean).	False	True, False
default_outlier_detectors	the specified outlier detector employ when augmenting csv data	['isolationforest']	['isolationforest', 'zscore']

License

This repository is licensed under a trade secret. Please do not share this code outside the core team.

Feedback

Any feedback on the book or this repository is greatly appreciated.

If you find something that is missing or doesn't work, please consider opening a GitHub issue.
If you'd like to be mentored by someone on our team, check out the Innovation Fellows Program.
If you want to talk to me directly, please send me an email @ js@neurolex.co.

Additional resources

You may want to read through the wiki for additional documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Allie

active things to finish before a live launch [ongoing list]

ongoing

recently completed (version 1.0.0 release)

getting started

MacOS

Windows 10

recommended installation (Docker)

alternative

Linux

folder structures

standard feature array

easy data exports

settings

License

Feedback

Additional resources

About

Releases 2

Packages

Languages

License

jim-schwoebel/allie

Folders and files

Latest commit

History

Repository files navigation

Allie

active things to finish before a live launch [ongoing list]

ongoing

recently completed (version 1.0.0 release)

getting started

MacOS

Windows 10

recommended installation (Docker)

alternative

Linux

folder structures

standard feature array

easy data exports

settings

License

Feedback

Additional resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages