-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Survival Analysis for Churn Prediction #275
Conversation
Eng Council bug ID b/135918337 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing as ML SME (still requires python readability)
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
end_date: None if subscription has not yet ended | ||
active: | ||
""" | ||
d1 = datetime.strptime('1/1/2018', '%m/%d/%Y') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these dates should be pulled into some sort of constants file or into parse_args
. Someone with a new dataset they would want to apply to this template would likely want to edit the dates without digging through the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dates are just used to generate fake data, so anybody editing the code to their own purposes are unlikely to use this function at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second pass over Tensorflow code. Consider using a linter. For sublime, go/sublime recommends one I think.
Contribution of each time interval to the loss: | ||
ln(hazard) for each | ||
|
||
Based off of: https://peerj.com/articles/6257.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: May be worth adding an args section, especially for num_intervals
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
|
||
|
||
@beam.ptransform_fn | ||
def shuffle(p): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sabhyankar I'm seeing this pattern pop up in a lot of ML SCE's beam pipelines.
In order to get randomly ordered, and randomly partitioned data, they are forcing a shuffle and having some kind of train-test-eval split function.
While PCollections don't guarantee ordering, they also don't randomize the data order unnecessarily.
If an input PCollection has some ordering to it (say by date) and you write it you'll see ordered output [ie. this gist demonstrating that while it forms random partitions, when writing each of them order is maintained].
Do you know any better way to randomize order of the data w/o forcing an expensive shuffle between nodes w/ GroupByKey
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klmilam don't consider this comment blocking. sameer is OOO and I'm just curious if there's a better pattern here.
# limitations under the License. | ||
"""Start preprocessing job for Survival Analysis TFRecords.""" | ||
|
||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this doesn't warrant it's own file move this into preprocess.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the juliaset beam example, which is generally used for a template for Python on Dataflow. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/juliaset_main.py
I'm fine with moving it into preprocess.py
, just want to make sure it's following best practices. Should I still move it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Touche, Not sure why that example is like that.
You'll see other example pipelines don't do that ie. the canonical wordcount example https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py#L114
As far as best practice here, I think it's a bit ambiguous I just think it's extra files for no real added value which is why I said IMO.
I'll leave it up to you.
dataset_type, flags.output_dir, metadata) | ||
|
||
|
||
def run(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing as beam.Pipeline
has a run
method.
In fact, AFAICT this method doesn't seem to run the pipeline at all, it just constructs a pipeline object.
I believe you need a to call p.run()
(assuming is an instance of beam.Pipeline
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think p.run() is required if you're using it with "with beam.Pipeline(runner, options=options) as p:". I've at least never have seen that in any code samples.
Would it be less confusing if I renamed this function? (although using run() also follows the Juliaset example)
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorFlow code LGTM, just need to pass the readability review. Remember to rebase and make sure the top level README is okay since the recommendation solution got merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Great! @jaketf can you merge it? |
* Add churn prediction code to PSO repo * Small fixes to scripts so they can be run from main directory * Slight formatting fixes * remove virtual environment directory * Convert preproessing code to be Python 3 compliant * Add information about dataset to README.md * Remove virtualenv directory * Add cloudml-churn-prediction description to main README.md * update virtualenv info on README.md * Fix typos * Fix function names to align with style guide * Fix parser syntax * Switch to idiomatic dict comprehension * Build Beam pipeline in helper function * Add extra line after docstring * Fix linter errors. * fix style of get_query fn name * Add unittesting * Fix typo in TensorBoard instructions
Pull Request Template
Please, go through these steps before you submit a PR.
Make sure that your PR is not a duplicate.
Before submitting the PR, review your changes:
Google LLC
.After these steps, you're ready to open a pull request.
closes #XXXX
in your comment to auto-close the issue that your PR fixes (if such).PLEASE REMOVE THIS TEMPLATE BEFORE SUBMITTING