Survival Analysis for Churn Prediction #275

klmilam · 2019-06-24T15:41:57Z

Pull Request Template

Please, go through these steps before you submit a PR.

Make sure that your PR is not a duplicate.
Before submitting the PR, review your changes:
- Adjust the code to the existing style.
- Add a set of unit tests for the changes and ensure they all pass.
- Add accompanying README.md file with instructions on usage. See awesome-readme for good examples of high-quality READMEs.
- Add a link to your contribution in the top-level README (alpha-order).
- Add Apache 2.0 license headers with an up-to-date copyright date attributed to Google LLC.
- Remove unnecessary LICENSE files. There's no need to include an additional license since all repository submissions are covered by the top-level Apache 2.0 license.
- For new tools/examples, file your project with the PSO Engineering Council.
After these steps, you're ready to open a pull request.
- Give a descriptive title to your PR.
- Provide a description of your changes.
- Include the Engineering Council bug ID with your pull request.
- Put closes #XXXX in your comment to auto-close the issue that your PR fixes (if such).

PLEASE REMOVE THIS TEMPLATE BEFORE SUBMITTING

klmilam · 2019-06-24T23:23:25Z

Eng Council bug ID b/135918337

TheMichaelHu

Reviewing as ML SME (still requires python readability)

examples/cloudml-churn-prediction/README.md

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

TheMichaelHu · 2019-06-27T21:16:15Z

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

+            end_date: None if subscription has not yet ended
+            active:
+    """
+    d1 = datetime.strptime('1/1/2018', '%m/%d/%Y')


I feel like these dates should be pulled into some sort of constants file or into parse_args. Someone with a new dataset they would want to apply to this template would likely want to edit the dates without digging through the code.

These dates are just used to generate fake data, so anybody editing the code to their own purposes are unlikely to use this function at all

TheMichaelHu

Second pass over Tensorflow code. Consider using a linter. For sublime, go/sublime recommends one I think.

examples/cloudml-churn-prediction/trainer/trainer/model.py

TheMichaelHu · 2019-06-28T17:12:33Z

examples/cloudml-churn-prediction/trainer/trainer/model.py

+    Contribution of each time interval to the loss:
+        ln(hazard) for each
+
+    Based off of: https://peerj.com/articles/6257.pdf


Optional: May be worth adding an args section, especially for num_intervals

examples/cloudml-churn-prediction/trainer/trainer/model.py

examples/cloudml-churn-prediction/trainer/trainer/task.py

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

jaketf · 2019-07-02T06:54:07Z

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

+
+
+@beam.ptransform_fn
+def shuffle(p):


@sabhyankar I'm seeing this pattern pop up in a lot of ML SCE's beam pipelines.
In order to get randomly ordered, and randomly partitioned data, they are forcing a shuffle and having some kind of train-test-eval split function.
While PCollections don't guarantee ordering, they also don't randomize the data order unnecessarily.
If an input PCollection has some ordering to it (say by date) and you write it you'll see ordered output [ie. this gist demonstrating that while it forms random partitions, when writing each of them order is maintained].
Do you know any better way to randomize order of the data w/o forcing an expensive shuffle between nodes w/ GroupByKey ?

@klmilam don't consider this comment blocking. sameer is OOO and I'm just curious if there's a better pattern here.

jaketf · 2019-07-02T07:06:35Z

examples/cloudml-churn-prediction/preprocessor/run_preprocessing.py

+# limitations under the License.
+"""Start preprocessing job for Survival Analysis TFRecords."""
+
+import logging


IMO this doesn't warrant it's own file move this into preprocess.py

This is following the juliaset beam example, which is generally used for a template for Python on Dataflow. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/juliaset_main.py

I'm fine with moving it into preprocess.py, just want to make sure it's following best practices. Should I still move it?

Touche, Not sure why that example is like that.
You'll see other example pipelines don't do that ie. the canonical wordcount example https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py#L114
As far as best practice here, I think it's a bit ambiguous I just think it's extra files for no real added value which is why I said IMO.
I'll leave it up to you.

jaketf · 2019-07-02T07:10:39Z

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

+            dataset_type, flags.output_dir, metadata)
+
+
+def run():


This is confusing as beam.Pipeline has a run method.
In fact, AFAICT this method doesn't seem to run the pipeline at all, it just constructs a pipeline object.
I believe you need a to call p.run() (assuming is an instance of beam.Pipeline)

I don't think p.run() is required if you're using it with "with beam.Pipeline(runner, options=options) as p:". I've at least never have seen that in any code samples.

Would it be less confusing if I renamed this function? (although using run() also follows the Juliaset example)

examples/cloudml-churn-prediction/trainer/trainer/model.py

examples/cloudml-churn-prediction/preprocessor/preprocessor/preprocess.py

TheMichaelHu

TensorFlow code LGTM, just need to pass the readability review. Remember to rebase and make sure the top level README is okay since the recommendation solution got merged in.

jaketf

LGTM.

klmilam · 2019-07-23T16:49:49Z

Great! @jaketf can you merge it?

* Add churn prediction code to PSO repo * Small fixes to scripts so they can be run from main directory * Slight formatting fixes * remove virtual environment directory * Convert preproessing code to be Python 3 compliant * Add information about dataset to README.md * Remove virtualenv directory * Add cloudml-churn-prediction description to main README.md * update virtualenv info on README.md * Fix typos * Fix function names to align with style guide * Fix parser syntax * Switch to idiomatic dict comprehension * Build Beam pipeline in helper function * Add extra line after docstring * Fix linter errors. * fix style of get_query fn name * Add unittesting * Fix typo in TensorBoard instructions

klmilam added 4 commits June 21, 2019 11:19

Add churn prediction code to PSO repo

c83a3ee

Small fixes to scripts so they can be run from main directory

728cee9

Slight formatting fixes

84f12b3

remove virtual environment directory

3f7b91f

googlebot added the cla: yes All committers have signed a CLA label Jun 24, 2019

klmilam added 5 commits June 26, 2019 19:06

Convert preproessing code to be Python 3 compliant

4b74e49

Add information about dataset to README.md

ddb739c

Remove virtualenv directory

802809f

Add cloudml-churn-prediction description to main README.md

da772f4

update virtualenv info on README.md

7cfb95d

TheMichaelHu requested changes Jun 27, 2019

View reviewed changes

klmilam added 6 commits June 27, 2019 18:32

Fix typos

af44c5f

Fix function names to align with style guide

badf83f

Fix parser syntax

0674674

Switch to idiomatic dict comprehension

2cc5fdf

Build Beam pipeline in helper function

5f5ed89

Add extra line after docstring

5c3c40d

TheMichaelHu requested changes Jun 28, 2019

View reviewed changes

klmilam added 2 commits June 29, 2019 01:29

Fix linter errors.

bb44a38

fix style of get_query fn name

5485b3e

jaketf reviewed Jul 2, 2019

View reviewed changes

jaketf suggested changes Jul 2, 2019

View reviewed changes

TheMichaelHu approved these changes Jul 2, 2019

View reviewed changes

klmilam added 2 commits July 11, 2019 19:59

Add unittesting

18287cb

Fix typo in TensorBoard instructions

1c9a9b2

jaketf approved these changes Jul 22, 2019

View reviewed changes

Merge branch 'master' into kmilam-churn

1375bd1

jaketf merged commit 05908cd into GoogleCloudPlatform:master Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Survival Analysis for Churn Prediction #275

Survival Analysis for Churn Prediction #275

klmilam commented Jun 24, 2019

klmilam commented Jun 24, 2019

TheMichaelHu left a comment •

edited

Loading

TheMichaelHu Jun 27, 2019

klmilam Jun 27, 2019

TheMichaelHu left a comment

TheMichaelHu Jun 28, 2019

jaketf Jul 2, 2019

jaketf Jul 8, 2019

jaketf Jul 2, 2019

klmilam Jul 2, 2019

jaketf Jul 8, 2019

jaketf Jul 2, 2019

klmilam Jul 10, 2019

TheMichaelHu left a comment

jaketf left a comment

klmilam commented Jul 23, 2019



		@beam.ptransform_fn
		def shuffle(p):

Survival Analysis for Churn Prediction #275

Survival Analysis for Churn Prediction #275

Conversation

klmilam commented Jun 24, 2019

Pull Request Template

klmilam commented Jun 24, 2019

TheMichaelHu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheMichaelHu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheMichaelHu left a comment

Choose a reason for hiding this comment

jaketf left a comment

Choose a reason for hiding this comment

klmilam commented Jul 23, 2019

TheMichaelHu left a comment •

edited

Loading