Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Survival Analysis for Churn Prediction #275

Merged
merged 20 commits into from
Jul 23, 2019

Conversation

klmilam
Copy link
Contributor

@klmilam klmilam commented Jun 24, 2019

Pull Request Template

Please, go through these steps before you submit a PR.

  1. Make sure that your PR is not a duplicate.

  2. Before submitting the PR, review your changes:

    • Adjust the code to the existing style.
    • Add a set of unit tests for the changes and ensure they all pass.
    • Add accompanying README.md file with instructions on usage. See awesome-readme for good examples of high-quality READMEs.
    • Add a link to your contribution in the top-level README (alpha-order).
    • Add Apache 2.0 license headers with an up-to-date copyright date attributed to Google LLC.
    • Remove unnecessary LICENSE files. There's no need to include an additional license since all repository submissions are covered by the top-level Apache 2.0 license.
    • For new tools/examples, file your project with the PSO Engineering Council.
  3. After these steps, you're ready to open a pull request.

    • Give a descriptive title to your PR.
    • Provide a description of your changes.
    • Include the Engineering Council bug ID with your pull request.
    • Put closes #XXXX in your comment to auto-close the issue that your PR fixes (if such).

PLEASE REMOVE THIS TEMPLATE BEFORE SUBMITTING

@googlebot googlebot added the cla: yes All committers have signed a CLA label Jun 24, 2019
@klmilam
Copy link
Contributor Author

klmilam commented Jun 24, 2019

Eng Council bug ID b/135918337

Copy link
Member

@TheMichaelHu TheMichaelHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing as ML SME (still requires python readability)

end_date: None if subscription has not yet ended
active:
"""
d1 = datetime.strptime('1/1/2018', '%m/%d/%Y')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these dates should be pulled into some sort of constants file or into parse_args. Someone with a new dataset they would want to apply to this template would likely want to edit the dates without digging through the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dates are just used to generate fake data, so anybody editing the code to their own purposes are unlikely to use this function at all

Copy link
Member

@TheMichaelHu TheMichaelHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass over Tensorflow code. Consider using a linter. For sublime, go/sublime recommends one I think.

Contribution of each time interval to the loss:
ln(hazard) for each

Based off of: https://peerj.com/articles/6257.pdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: May be worth adding an args section, especially for num_intervals



@beam.ptransform_fn
def shuffle(p):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabhyankar I'm seeing this pattern pop up in a lot of ML SCE's beam pipelines.
In order to get randomly ordered, and randomly partitioned data, they are forcing a shuffle and having some kind of train-test-eval split function.
While PCollections don't guarantee ordering, they also don't randomize the data order unnecessarily.
If an input PCollection has some ordering to it (say by date) and you write it you'll see ordered output [ie. this gist demonstrating that while it forms random partitions, when writing each of them order is maintained].
Do you know any better way to randomize order of the data w/o forcing an expensive shuffle between nodes w/ GroupByKey ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klmilam don't consider this comment blocking. sameer is OOO and I'm just curious if there's a better pattern here.

# limitations under the License.
"""Start preprocessing job for Survival Analysis TFRecords."""

import logging
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this doesn't warrant it's own file move this into preprocess.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the juliaset beam example, which is generally used for a template for Python on Dataflow. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/juliaset_main.py

I'm fine with moving it into preprocess.py, just want to make sure it's following best practices. Should I still move it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Touche, Not sure why that example is like that.
You'll see other example pipelines don't do that ie. the canonical wordcount example https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py#L114
As far as best practice here, I think it's a bit ambiguous I just think it's extra files for no real added value which is why I said IMO.
I'll leave it up to you.

dataset_type, flags.output_dir, metadata)


def run():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing as beam.Pipeline has a run method.
In fact, AFAICT this method doesn't seem to run the pipeline at all, it just constructs a pipeline object.
I believe you need a to call p.run() (assuming is an instance of beam.Pipeline)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think p.run() is required if you're using it with "with beam.Pipeline(runner, options=options) as p:". I've at least never have seen that in any code samples.

Would it be less confusing if I renamed this function? (although using run() also follows the Juliaset example)

Copy link
Member

@TheMichaelHu TheMichaelHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorFlow code LGTM, just need to pass the readability review. Remember to rebase and make sure the top level README is okay since the recommendation solution got merged in.

Copy link

@jaketf jaketf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@klmilam
Copy link
Contributor Author

klmilam commented Jul 23, 2019

Great! @jaketf can you merge it?

@jaketf jaketf merged commit 05908cd into GoogleCloudPlatform:master Jul 23, 2019
holtskinner pushed a commit to holtskinner/professional-services that referenced this pull request Jul 29, 2019
* Add churn prediction code to PSO repo

* Small fixes to scripts so they can be run from main directory

* Slight formatting fixes

* remove virtual environment directory

* Convert preproessing code to be Python 3 compliant

* Add information about dataset to README.md

* Remove virtualenv directory

* Add cloudml-churn-prediction description to main README.md

* update virtualenv info on README.md

* Fix typos

* Fix function names to align with style guide

* Fix parser syntax

* Switch to idiomatic dict comprehension

* Build Beam pipeline in helper function

* Add extra line after docstring

* Fix linter errors.

* fix style of get_query fn name

* Add unittesting

* Fix typo in TensorBoard instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes All committers have signed a CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants