Rework example and e2e test script #126

TimZaman · 2019-01-11T06:06:24Z

Submitted as WIP, everything working, just wrapping up my tests for the mpi backend. Let me know if this is OKAY. There is quite some redundancy/small bugs/outdated/unmaintained in the examples dir, so I decided to take a stab to clean it up! :-)

(Also I don't think maybe the MPI backend should be promoted, the kubeflow/MPI-operator is superior for this purpose: tailored for MPI and topology awareness).

Unify many copies of mnist training scripts into a single one.
- Backends: Gloo, NCCL, MPI.
- Bases of the official pytorch/examples/mnist example.
- Uses TensorboardX for summary writing.
- Uses a backend argument to select backend (if applicable).
Update all examples and dockerfiles to Pytorch 1.0.
Remove examples using deprecated TCP backend.
Remove old v1alpha1 examples.

This change is

k8s-ci-robot · 2019-01-11T06:06:38Z

Hi @TimZaman. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2019-01-11T06:26:46Z

Coverage remained the same at 73.269% when pulling 93e118f on TimZaman:tzaman/refactor-examples into 9261b60 on kubeflow:master.

johnugeorge · 2019-01-15T18:22:56Z

/ok-to-test

johnugeorge · 2019-01-15T18:42:48Z

@TimZaman Great work. Thanks for your contributions

Few comments

tcp-mnist image is currently built and run during the CI process. Since the tcp example is removed, you need to point to the new example for CI process to pass.
Image is built in
https://github.com/kubeflow/pytorch-operator/blob/master/scripts/build.sh#L49
The image which is built, is used in tests
https://github.com/kubeflow/pytorch-operator/tree/master/scripts/v1alpha2
https://github.com/kubeflow/pytorch-operator/tree/master/scripts/v1beta1
Can you add/merge details from existing READMEs into the new one?

/cc @andreyvelich
/cc @Akado2009

examples/mnist/mnist.py

TimZaman · 2019-01-15T19:54:33Z

Ok thanks guys, I'll take it that this PR is appreciated (it does update a ton of stuff in examples so wasn't sure). I'll move forward to address comments and remove the WIP state.

- Unify many copies of mnist training scripts into a single one. - Backends: Gloo, NCCL, MPI. - Bases of the official pytorch/examples/mnist example. - Uses TensorboardX for summary writing. - Uses a backend argument to select backend (if applicable). - Update all examples and dockerfiles to Pytorch 1.0. - Remove examples using deprecated TCP backend. - Remove old v1alpha1 examples.

TimZaman · 2019-01-17T00:59:54Z

All comments addressed. As I'm unfamiliar with this project's CI process, I'll see how things go wrt the build.sh.

johnugeorge · 2019-01-17T08:31:19Z

@TimZaman CI tests fail because mnist tests have failed.

https://github.com/kubeflow/pytorch-operator/blob/master/scripts/v1alpha2/run-defaults.sh#L49
https://github.com/kubeflow/pytorch-operator/blob/master/scripts/v1alpha2/run-cleanpodpolicy-all.sh#L50

Each test have a timeout of 10 minutes. I think, tests fail because they won't get completed in 10 minutes and they time out. Since we are not concerned about the accuracy of the example, we can keep the default epoch of the example to be 1(https://github.com/kubeflow/pytorch-operator/pull/126/files#diff-f5284f418f68386cd925f0081ba5dcc9R83). WDYT?

TimZaman · 2019-01-17T08:44:29Z

SGTM, how do you know its timeouts?

…

On Thu, Jan 17, 2019 at 12:31 AM Johnu George ***@***.***> wrote: @TimZaman <https://github.com/TimZaman> CI tests fail because mnist tests have failed. https://github.com/kubeflow/pytorch-operator/blob/master/scripts/v1alpha2/run-defaults.sh#L49 https://github.com/kubeflow/pytorch-operator/blob/master/scripts/v1alpha2/run-cleanpodpolicy-all.sh#L50 Each test have a timeout of 10 minutes. I think, tests fail because they won't get completed in 10 minutes and they time out. Since we are not concerned about the accuracy of the example, we can keep the default epoch of the example to be 1( https://github.com/kubeflow/pytorch-operator/pull/126/files#diff-f5284f418f68386cd925f0081ba5dcc9R83). WDYT? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHXSRE0yo_PuntXiy7OWW7wfbwLqGz1zks5vEDTYgaJpZM4Z62nH> .

Should help reduce the change of CI timeout

johnugeorge · 2019-01-17T09:07:15Z

Timeout is a parameter in the test.
https://github.com/kubeflow/pytorch-operator/blob/master/test/e2e/v1alpha2/defaults.go#L33

johnugeorge · 2019-01-17T09:08:52Z

/retest

johnugeorge · 2019-01-17T09:52:27Z

Tests passed. Thanks @TimZaman

/lgtm
/approve

k8s-ci-robot · 2019-01-17T09:52:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress size/XXL labels Jan 11, 2019

k8s-ci-robot requested review from elsonrodriguez and johnugeorge January 11, 2019 06:06

k8s-ci-robot added the needs-ok-to-test label Jan 11, 2019

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jan 15, 2019

k8s-ci-robot requested review from Akado2009 and andreyvelich January 15, 2019 18:42

johnugeorge reviewed Jan 15, 2019

View reviewed changes

examples/mnist/mnist.py Outdated Show resolved Hide resolved

johnugeorge reviewed Jan 15, 2019

View reviewed changes

examples/mnist/mnist.py Outdated Show resolved Hide resolved

andreyvelich reviewed Jan 15, 2019

View reviewed changes

examples/mnist/mnist.py Outdated Show resolved Hide resolved

examples/mnist/mnist.py Show resolved Hide resolved

TimZaman added 4 commits January 16, 2019 16:56

Update base image used in smoke-dist, remove unused v1alpha1

6b3d86a

Refactor the mnist example to work with Katib

5603d1c

Upgrade readme with latest examples and directories

1ac3901

TimZaman changed the title ~~[WIP]: Rework examples~~ Rework example and e2e test script Jan 17, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress label Jan 17, 2019

Change default number of epochs to 1

93e118f

Should help reduce the change of CI timeout

k8s-ci-robot assigned johnugeorge Jan 17, 2019

k8s-ci-robot added the lgtm label Jan 17, 2019

k8s-ci-robot added the approved label Jan 17, 2019

k8s-ci-robot merged commit bcb6a34 into kubeflow:master Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework example and e2e test script #126

Rework example and e2e test script #126

Uh oh!

TimZaman commented Jan 11, 2019 •

edited by jose5918

Loading

Uh oh!

k8s-ci-robot commented Jan 11, 2019

Uh oh!

coveralls commented Jan 11, 2019 •

edited

Loading

Uh oh!

johnugeorge commented Jan 15, 2019

Uh oh!

johnugeorge commented Jan 15, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimZaman commented Jan 15, 2019

Uh oh!

TimZaman commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

TimZaman commented Jan 17, 2019 via email

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

k8s-ci-robot commented Jan 17, 2019

Uh oh!

Uh oh!

Rework example and e2e test script #126

Rework example and e2e test script #126

Uh oh!

Conversation

TimZaman commented Jan 11, 2019 • edited by jose5918 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 11, 2019

Uh oh!

coveralls commented Jan 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnugeorge commented Jan 15, 2019

Uh oh!

johnugeorge commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimZaman commented Jan 15, 2019

Uh oh!

TimZaman commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

TimZaman commented Jan 17, 2019 via email

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

johnugeorge commented Jan 17, 2019

Uh oh!

k8s-ci-robot commented Jan 17, 2019

Uh oh!

Uh oh!

TimZaman commented Jan 11, 2019 •

edited by jose5918

Loading

coveralls commented Jan 11, 2019 •

edited

Loading

johnugeorge commented Jan 15, 2019 •

edited

Loading