-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Random Forest Step #46
base: dev
Are you sure you want to change the base?
Conversation
…om model expression
…ation to random forest
Hi @Gitiauxx, Apologies, I'm just getting a chance to review this. Generally looks great; here are some initial comments:
I believe it is fixed.
Done.
Yes and no: However, I do not need _get_input_columns() anymore because I created a utility that takes patsy formula and generates the corresponding inputs for sklearn models (see from_patsy_to_array in utils.py). It also creates a list of variable names. Related to the issue of generalizing away from statsmodel based steps, see #43 with explanation for why we need a utility like convert_to_model to convert sklearn models' fit and predict methods into methods that mimic what statsmodel fit and predict methods do.
Absolutely. I believe I went over all of them. Let me know if the format works for you.
Fixed. Probably my mistake when I merged master with another branch.
Done.
Done.
|
Awesome, thank you for all the fixes! This looks great. I left some related comments in issues #43 and #50 too. The functionality of the templates looks good, and the cross-validation is going to be really helpful. A couple more issues with the tests: The cross-validation and gradient boosting tests use data files that aren't committed to the repo. (I've just been using fake data for tests so far, but real data seems great.) I forget, are the files small enough that we can just add them to the repo? Under 1 MB would be great. If not, would the tests still work with smaller excerpts of the data? The file paths in those tests are also Windows-specific; you can use And you can move the neural network tests to the branch with that model. Once the tests run on my machine, let's merge this! |
Just took a look at the merge conflicts (easy to fix in the web interface)..
|
Thanks @smmaurer I will go over and develop a comprehensive battery of tests for this PR using pytest. |
I start running a battery of tests (see most commits) for the code I wrote. So far they have passed on my machine (windows). You may want to try them on a mac. |
@Gitiauxx Looking good, most of them are passing for me too. Cross validation is failing with Looking at the yaml for Random Forest, i noticed a couple of things:
|
No problems.... I believe I fixed everything. cross_validate was failing because I forgot to push my last edit of shared.py... |
Thanks for creating this! Will there be a RF classifier on the way? |
This PR adds a template for RandomForestStep.
The template creates automatically a random forest model using the sklearn libary with a fit and run The code can be found in regression.RandomForestStep.
For an example with real data, see machine_learning_steps_rental_prices
The model can be saved in the configs directory as a yaml file and a pickled file.
Usage
A random forest is created as follows:
m = RandomForestRegressionStep()
RandomForestRegressionStep takes thew following inputs:
m.model_expression = '... ~ ...'
m.tables = '...'
m.name = '....'
The main difference with previous steps is that the model, once built, cannot be saved completely in a yaml file and has to be pickled:
To save a random forest model, use the register method from modelmanager.py:
modelmanager.register(m)
which will add in orca a step with name
m.name
and pickle the model in a file saved in the configs directory.To load a model that has been saved in the configs directory, use the initialize methods from the modelmanager.py:
modelmanager.initialize()
which will unpickle the model and turn it back to an object RandomForestStep.
Fitting and cross-validation
The random forest step includes some validation metrics:
m.importance
m.cross_validate_scorer(nsplits=k)
returns the mean of the errors across all folds.
Additional libraries
RandomForestStep is built on top of sklearn machine learninglibrary. Version 0.19.2 or more recent is required. Moreover to efficiently save and load complicated random forest models, modelmanager provides a pickle./unplickle utility based on dill (version 0.2.8.2 or more recent).