Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apps CD scripts need to be updated to work with kustomize refactor #631

Closed
jlewi opened this issue Mar 24, 2020 · 31 comments
Closed

Apps CD scripts need to be updated to work with kustomize refactor #631

jlewi opened this issue Mar 24, 2020 · 31 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Mar 24, 2020

Design Doc: http://bit.ly/kf_kustomize_v3

We will likely refactor the kustomize packages to create a more kustomize native experience.

To do that we will need to update our apps CD scripts to specify the locations of the new images.
https://github.com/kubeflow/testing/blob/master/apps-cd/applications.yaml

We should likely keep the locations of the new and old packages so they stay in sync until we get rid of the old packages.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.88

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 2, 2020

@Jeffwan @terrytangyuan FYI; another engprod issue that we will want to fix in order to automatically update the manifests with newly built images.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.64

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@pingsutw
Copy link
Member

pingsutw commented Jun 2, 2020

@jlewi, Hi I'm Kevin Su.
I would like to try fixing this one!

@pingsutw
Copy link
Member

pingsutw commented Jun 2, 2020

/assign

@jlewi
Copy link
Contributor Author

jlewi commented Jun 3, 2020

Thanks @pingsutw this would be a huge help!

In case its helpful: Here's an example of an auto generated PR to update the manifest for the jupyter webapp
kubeflow/manifests#1221

The file that is getting updated is
https://github.com/kubeflow/manifests/blob/master/jupyter/jupyter-web-app/base/kustomization.yaml

We also need to update this file:
https://github.com/kubeflow/manifests/blob/master/jupyter/jupyter-web-app/base_v3/kustomization.yaml

@jlewi
Copy link
Contributor Author

jlewi commented Jun 8, 2020

@pingsutw How's this going?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/jupyter 0.50

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@pingsutw
Copy link
Member

pingsutw commented Jun 8, 2020

@jlewi Sorry for the reply.
Thanks for the help.
I just sent a PR, could you help me check whether it's correct?

@jlewi
Copy link
Contributor Author

jlewi commented Jun 9, 2020

@pingsutw you can check
https://github.com/kubeflow/manifests/pulls/kubeflow-bot

And see if updated PRs are being created.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 15, 2020

@pingsutw How's this going?

@pingsutw
Copy link
Member

pingsutw commented Jun 16, 2020

sorry @jlewi, I thought this PR is already done in #673, So I work on other kubeflow issues recently.

If I understand correctly, https://github.com/kubeflow/testing/blob/master/apps-cd/applications.yaml will update docker tag in https://github.com/kubeflow/manifests

So I'm not sure which manifest also needs to update and be added in application.yaml.
Could you give me some advice, I'm glad to work on this PR.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 16, 2020

Thanks @pingsutw ; we will need to create updates for most of the applications you can get a list here
kubeflow/kubeflow#5022

I went through the file and compared to the manifests; looks like we just need a couple more to
be updated

Also it looks like the unittests are failing for the updated PRs
e.g:
kubeflow/manifests#1253

Could you try cloning one of the auto-update PRs locally and running the unittests to see why they are failing?

@pingsutw
Copy link
Member

@jlewi gotcha, it's very clear to me.
Let's do it.

@andreyvelich
Copy link
Member

Hi @pingsutw!
I was testing this branch https://github.com/kubeflow-bot/manifests/tree/update_pytorch-operator_vmaster-gd596e904.
We need to run make generate to update test files.
After that, I was able to run make test, so unit test should be sucessfull.

@jlewi Do you know, how we can trigger kubeflow-bot to run make generate on appropriate branch?

@pingsutw
Copy link
Member

Hi, I just test branch kubeflow/manifests#1253 as well.
Following command to reproduce:

git clone https://github.com/kubeflow/manifests.git
cd manifests
git fetch origin pull/1253/head:test
git checkout test

cd tests
make generate-changed-only

I found that make generate-changed-only will modify docker tag back to the original version

$ git diff

iff --git a/tests/go.mod b/tests/go.mod
index 899b522..7155915 100644
--- a/tests/go.mod
+++ b/tests/go.mod
@@ -3,7 +3,6 @@ module github.com/kubeflow/manifests/tests
 go 1.13
 
 require (
-       github.com/PuerkitoBio/purell v1.1.1 // indirect
        github.com/ghodss/yaml v1.0.1-0.20190212211648-25d852aebe32
        github.com/gogo/protobuf v1.2.2-0.20190730201129-28a6bbf47e48 // indirect
        github.com/golang/protobuf v1.3.2 // indirect
diff --git a/tests/stacks/examples/alice/test_data/expected/apps_v1_deployment_pytorch-operator.yaml b/tests/stacks/examples/alice/test_data/expected/apps_v1_deployment_pytorch-operator.yaml
index 8897df4..a8ad07d 100644
--- a/tests/stacks/examples/alice/test_data/expected/apps_v1_deployment_pytorch-operator.yaml
+++ b/tests/stacks/examples/alice/test_data/expected/apps_v1_deployment_pytorch-operator.yaml
@@ -40,6 +40,6 @@ spec:
           valueFrom:
             fieldRef:
               fieldPath: metadata.name
-        image: gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g047cf0f
+        image: gcr.io/kubeflow-images-public/pytorch-operator:vmaster-gd596e904
         name: pytorch-operator
       serviceAccountName: pytorch-operator
diff --git a/tests/stacks/generic/test_data/expected/apps_v1_deployment_pytorch-operator.yaml b/tests/stacks/generic/test_data/expected/apps_v1_deployment_pytorch-operator.yaml

@andreyvelich Did you have the same problem?

@andreyvelich
Copy link
Member

@pingsutw I think you just need to run make generate using this branch: https://github.com/kubeflow-bot/manifests/tree/update_pytorch-operator_vmaster-gd596e904.

You can run make generate-changed-only if you modify any of manifests files, but in update_pytorch-operator_vmaster-gd596e904 branch all files are already changed.
You just need to update test files for them.

After updating test files in kubeflow-bot branch, you should be able to run make test.

@andreyvelich
Copy link
Member

@pingsutw Do you know how we can manually update branch: https://github.com/kubeflow-bot/manifests/tree/update_tf_operator_vmaster-ga2ae7bff ?

@pingsutw
Copy link
Member

@pingsutw I think you just need to run make generate using this branch: https://github.com/kubeflow-bot/manifests/tree/update_pytorch-operator_vmaster-gd596e904.

I did it as well.

If I use make generate, it will also modify docker tag back to the original version.
It's should not change any local file.

@pingsutw Do you know how we can manually update branch: https://github.com/kubeflow-bot/manifests/tree/update_tf_operator_vmaster-ga2ae7bff ?

Sorry, I don't know how to update that branch

@jlewi
Copy link
Contributor Author

jlewi commented Jun 16, 2020

@pingsutw @andreyvelich we don't want to manually update the branch. I think we need to fix the update code to run the proper command.

util.run(["make", "generate-changed-only"], tests_dir)

So you can change that to make generate. But there might be something else going on. Because make-generate-changed-only should be doing the same thing as make generate
https://github.com/kubeflow/manifests/blob/f68a7fbb2917d450b685da800d3d4612d14536d2/tests/Makefile#L43

Because the all parameter isn't being used anymore.
https://github.com/kubeflow/manifests/blob/f68a7fbb2917d450b685da800d3d4612d14536d2/hack/generate_tests.py#L113

@jlewi
Copy link
Contributor Author

jlewi commented Jun 16, 2020

It looks to me like the expected values for the tests.
https://github.com/kubeflow/manifests/pull/1253/files

Is trying to change the image to
vmaster-gd596e904

So we need to update the expected value for the test which is in this file
https://github.com/kubeflow/manifests/blob/f68a7fbb2917d450b685da800d3d4612d14536d2/tests/stacks/examples/alice/test_data/expected/apps_v1_deployment_pytorch-operator.yaml#L43

make generate should take care of that but its not.

@pingsutw I'm a little confused by your diff because based on the diff it looks like make-generate changed is modifying the test data and setting the image vmaster-gd596e904

-        image: gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g047cf0f
+        image: gcr.io/kubeflow-images-public/pytorch-operator:vmaster-gd596e904

So it looks like make generate is doing the correct thing but the PR doesn't include this change.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 16, 2020

@pingsutw i wonder if for some reason make generate is modifying the file but the file isn't getting included in the commit? (e.g. ignored by .gitingore)

util.run(["git", "commit", "-a", "-F", message_file], cwd=manifests_repo)

You might want to look into getting viewer access to the stackdriver logs so you can see what happened.
https://github.com/kubeflow/testing/tree/master/release-infra

@pingsutw
Copy link
Member

Sorry, I'm wrong. I didn't notice the path is tests/stacks/examples/alice/test_data/....

@pingsutw I wonder if for some reason make generate is modifying the file but the file isn't getting included in the commit? (e.g. ignored by .gitingore)

yap, there is a file that isn't getting included in the commit.
@jlewi Thanks, I will try to address this issue.

@andreyvelich
Copy link
Member

andreyvelich commented Jun 16, 2020

@pingsutw @andreyvelich we don't want to manually update the branch. I think we need to fix the update code to run the proper command.

util.run(["make", "generate-changed-only"], tests_dir)

So you can change that to make generate. But there might be something else going on. Because make-generate-changed-only should be doing the same thing as make generate
https://github.com/kubeflow/manifests/blob/f68a7fbb2917d450b685da800d3d4612d14536d2/tests/Makefile#L43

Because the all parameter isn't being used anymore.
https://github.com/kubeflow/manifests/blob/f68a7fbb2917d450b685da800d3d4612d14536d2/hack/generate_tests.py#L113

@jlewi Yes, you are right.
I also tried to run make generate-changed-only on this branch and all require tests were generated.

How we can check that this script (https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/cd/create_manifests_pr.py) was properly run?

@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

@andreyvelich @pingsutw

How we can check that this script (https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/cd/create_manifests_pr.py) was properly run?

You would need to get access to the GCP release infrastructure. The GCP project it is running in is here:
https://github.com/kubeflow/testing/tree/master/apps-cd#run-a-pipeline

I think automation of access control to the release infra is a bit wanting.
We currently use this group to manage the release team
https://github.com/kubeflow/internal-acls/blob/master/release-team.members.txt

So you could add yourself there and then ping me to sync the groups after the PR is merged.

@andreyvelich
Copy link
Member

@andreyvelich @pingsutw

How we can check that this script (https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/cd/create_manifests_pr.py) was properly run?

You would need to get access to the GCP release infrastructure. The GCP project it is running in is here:
https://github.com/kubeflow/testing/tree/master/apps-cd#run-a-pipeline

I think automation of access control to the release infra is a bit wanting.
We currently use this group to manage the release team
https://github.com/kubeflow/internal-acls/blob/master/release-team.members.txt

So you could add yourself there and then ping me to sync the groups after the PR is merged.

Sure, thanks. I created the PR: kubeflow/internal-acls#266

@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

I tried to dig in a little bit. Having a bit of trouble locating the runs

The Tekton dashboard is crashing:
https://kf-releasing-0-6-2.endpoints.kubeflow-releasing.cloud.goog/tekton/#/pipelineruns

kubectl --context=kf-releasing -n kf-releasing get taskruns --sort-by=".metadata.creationTimestamp"

Youngest pipeline is a 141 days. I'm going to try GC'ing the old runs to see if that helps.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

kubectl --context=kf-releasing -n kf-releasing delete pipelineruns --all=true

Looks like they are currently running in the kf-releasing-dev namespace.

kf-releasing-dev   cd-tf-operator-a2ae7bffp672s           True        Succeeded                         2d18h       2d18h

@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

Clean up kf-releasing-dev

kubectl --context=kf-releasing -n kf-releasing-dev delete pipelineruns --all=true
kubectl --context=kf-releasing -n kf-releasing-dev delete taskruns --all=true

@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

Here are the logs for one of the pods creating the PR
cd-tf-operator-a2ae7bffp672s-update-manifests-vsxm5-pod-hqssr.log.txt

It looks like make generate-changed-only is failing because we are missing the jinja2 library

jlewi pushed a commit to jlewi/testing that referenced this issue Jun 17, 2020
* The scripts to generate the tests now depend on the jinja2 library but
  its not in the container.

* Add some docs for debugging.

* Related to kubeflow#631
@jlewi
Copy link
Contributor Author

jlewi commented Jun 17, 2020

@pingsutw We will also want to create a v1.1.0 config

- name: v1-0

This should point at the v1.1-branches of the applications and the v1.1-branch of the manifests
The tag specified in the config is used as a prefix of the image built. So that's how we affix a "v1.1.0" tag to the images

/cc @andreyvelich

jlewi pushed a commit to jlewi/testing that referenced this issue Jun 17, 2020
* The scripts to generate the tests now depend on the jinja2 library but
  its not in the container.

* Add some docs for debugging.

* Related to kubeflow#631

* Catch FilenotFoundErrors
  * The problem is that on master the location of some of the kustomize
    manifests changes (e.g. v3 versions) but for the v1.0 branches
    these paths won't exist. So we should just catch these errors
    and continue.

* Update the docker image used by the Tekton pipeline because we need jinja2.
k8s-ci-robot pushed a commit that referenced this issue Jun 18, 2020
* The scripts to generate the tests now depend on the jinja2 library but
  its not in the container.

* Add some docs for debugging.

* Related to #631

* Catch FilenotFoundErrors
  * The problem is that on master the location of some of the kustomize
    manifests changes (e.g. v3 versions) but for the v1.0 branches
    these paths won't exist. So we should just catch these errors
    and continue.

* Update the docker image used by the Tekton pipeline because we need jinja2.
@jlewi jlewi closed this as completed Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants