-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp how Tekton pipelines to run notebooks work. #703
Conversation
Notebook tests should build a docker image to run the notebook in. * kubeflow#613 currently the way we run notebook tests is by firing off a K8s job on the KF cluster which runs the notebook. * The K8s job uses init containers to pull in source code and install dependencies like papermill. * This is a bit brittle. * To fix this we will instead use Tekton to build a docker image that takes the notebook image and then adds the notebook code to it. * Dockerfile.notebook_runner dockerfile to build the test image. The pipeline to run the notebook consists of two tasks 1. A Tekton Task to build a docker image to run the notebook in 1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster. Here's a list of changes to make this work * tekton_client should provide methods to upload artifacts but not parse junits * Add a tekton_client method to construct the full image URL based on the digest returned from kaniko * Copy over the code for running the notebook tests from kubeflow/examples and start modifying it. * Create a simple CLI to wait for nomos to sync resources to the cluster * This is used in some syntactic sugar make rules to aid the dev-test loop The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF deployments don't have proper GSA's to write to GCS. Related to: kubeflow#613
…nning under python2.
/test all |
/lgtm |
/test all |
/lgtm |
@@ -48,7 +48,7 @@ spec: | |||
value: /workspace/kubeconfig | |||
- name: PYTHONPATH | |||
value: /workspace/$(inputs.resources.testing-repo.name)/py | |||
image: gcr.io/kubeflow-ci/test-worker-py3:6f0d932-dirty@sha256:06ebe5412d638e3e51bdd792aecbafdc4ee1e7146ff367a7be346cd726738cbb | |||
image: gcr.io/kubeflow-ci/test-worker-py3:3780b5d-dirty@sha256:4a766d6f5cc6cbcb00dbc96205f7a5b2816bc5f2b6d516fd67124d4a3e6508ea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with building these images? Are the git sha hard coded or are these auto updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images are currently built using skaffold. We use a CLI option with skaffold to emit the URL of the image to a json file. We then use kpt to change the images to point at the new image. There's a make rule to provide syntactic sugar to string these commands together.
Ideally we would automate this so that on postsubmit new images would be automatically built and a PR opened to update all the images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the explanation!
@@ -0,0 +1,62 @@ | |||
// nomos-wait is a simple tool to wait until nomos has been sync'd to the current commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is noms and what is it used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nomos is Google's GitOps tooling
https://cloud.google.com/anthos-config-management/docs
We need to wait for the actual K8s resources to be sync'd from the Git repo to the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
expected := "79629ca7" | ||
if commit != expected { | ||
t.Errorf("Got commit: %v; want %v", commit, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/question: could EqualErrorf
be used instead?
@@ -5,17 +5,10 @@ that we use to run a bunch of our test and release scripts. | |||
|
|||
## To update the test worker images used in the Tekton tasks | |||
|
|||
1. Build a new image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this done now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow; its done using the make command
listed here which performs the steps I mentioned above and which were listed here in the README.
# The YAML is modified by nb_test_util.py to generate a Job specific | ||
# to a notebook. | ||
# | ||
# TODO(jlewi): We should switch to using Tekton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand this PR set is up to use Tekton so this should be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. This is about how we run the notebook on the actual KF cluster. We are currently using a K8s job; we could potentially use a Tekton task if we install Tekton on KF clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had misunderstood it, thanks for the clarification!
"""Get a stack driver link for the job with the specified name.""" | ||
logs_filter = f"""resource.type="k8s_container" | ||
labels."k8s-pod/job-name" = "{job_name}" | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do that. """ is a literal string. So if we indent line 25 we will end up including some extra whitespace in the value which we don't necessarily want.
@@ -15,34 +15,44 @@ spec: | |||
- name: test-target-name | |||
value: manual-testinig | |||
- name: artifacts-gcs | |||
value: gs://kubeflow-ci-deployment/gabrielwen-testing-2 | |||
value: gs://kubeflow-ci_temp/jlewi_mnist_testing/2020-0619 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be used as default or should we change, looks like something used during development.
tekton/runs/nb-test-run.yaml
Outdated
- name: testing-repo | ||
resourceSpec: | ||
type: git | ||
params: | ||
- name: url | ||
value: https://github.com/kubeflow/testing.git | ||
value: https://github.com/jlewi/testing.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this point to Kubeflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing-repo shouldn't be a resource anymore.
/test all It was the py unittests that failed. |
Tests passed on retry; looks like some flake. I couldn't find any pod logs for the failed run.
So not sure what happened. |
New changes are detected. LGTM label has been removed. |
@NikeNano I addressed all your comments; PTAL. |
from being copied and thus the results from showing up in testgrid see kubeflow#703
LGTM |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@NikeNano you need to use the chatbot command "/lgtm" |
Notebook tests should build a docker image to run the notebook in.
Tekton tasks and pipelines to run notebooks and generate reports #613 currently the way we run notebook tests is
by firing off a K8s job on the KF cluster which runs the notebook.
The K8s job uses init containers to pull in source code and install
dependencies like papermill.
This is a bit brittle.
To fix this we will instead use Tekton to build a docker image that
takes the notebook image and then adds the notebook code to it.
The pipeline to run the notebook consists of two tasks
A Tekton Task to build a docker image to run the notebook in
A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.
Here's a list of changes to make this work
tekton_client should provide methods to upload artifacts but not parse
junits
Add a tekton_client method to construct the full image URL based on
the digest returned from kaniko
Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.
Create a simple CLI to wait for nomos to sync resources to the cluster
The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.
Related to: #613