Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks: Implement deploy script and guide for EKS #78

Merged
merged 128 commits into from
Nov 28, 2022
Merged

eks: Implement deploy script and guide for EKS #78

merged 128 commits into from
Nov 28, 2022

Conversation

dektech
Copy link
Contributor

@dektech dektech commented Sep 23, 2022

Following the discussion on #1389, opening a PR for better visibility and easier commenting/suggestions.

This branch contains a script that will create an EKS cluster with all necessary additions, including 2 nodegroups - infra and plan.

It also includes a detailed, step by step guide.

Please feel free to go through the materials and test the script, and let us know if anything needs to be added or amended.

Review

This feature will take some time to review and test, this is our current status:

@galargh

  • Testing
  • Cluster setup went through
  • Cluster seems to be working (ran ping and storm tests)
  • Reviewing

@laurentsenta

  • Testing
  • Cluster setup went through
  • Cluster seems to be working (ran ping and storm tests)
  • Confirmed the bandwidths and latency changes where correctly applied (double check the ping and storm test logs)
  • Reviewing

@galargh
Copy link
Contributor

galargh commented Oct 20, 2022

Looks like my tasks from http://a496563c54d424aee8c6ab8e02ce2b6c-869758685.eu-west-3.elb.amazonaws.com/tasks are gone. There might be some storage misconfiguration.

@galargh
Copy link
Contributor

galargh commented Oct 20, 2022

As for IPv6, let's not include it in this PR.

@galargh
Copy link
Contributor

galargh commented Oct 20, 2022

@laurentsenta is going to go through all the open comments and he's going to gather the ones that have to be handled as part of this PR and discard the rest. He's going to post a summary as a comment.

@laurentsenta
Copy link
Contributor

@dektech @brdji thanks for the work on these PRs. Tracking all these discussions in each PRs is getting tricky, here is the summary as a list of tasks: testground/testground#1499

@AbominableSnowman730
Copy link
Contributor

I tried running storm with 400 instances on the default settings. After 300 pods were schedules, I started seeting:

Oct 11 10:14:54.499138	�[33mWARN�[0m	testplan received event	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "event": "obj<tg-benchmarks-cd2k0qp26un3g1004810-single-303> type<MODIFIED> reason<FailedScheduling> message<0/4 nodes are available: 2 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector.> type<Warning> count<7> lastTimestamp<2022-10-11 10:14:53 +0000 UTC>"}

It doesn't seem to back off on its' own in a situation like that. Instead it keeps trying to schedule pods. Is that expected? Is there a timeout? Max number of tries set?

I killed the run. Then I saw a bunch of things like this in the logs:

Oct 11 10:19:08.485745	�[35mDEBUG�[0m	deleting pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399"}
Oct 11 10:19:08.485800	�[31mERROR�[0m	couldn't remove pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399", "err": "context canceled"}

Now I scheduled a run with 200 instances (which worked previously) but I'm still seeing scheduling errors in the logs:

Oct 11 10:20:48.127055	�[33mWARN�[0m	testplan received event	{"runner": "cluster:k8s", "run_id": "cd2k6u126un3g100481g", "event": "obj<tg-benchmarks-cd2k6u126un3g100481g-single-194> type<ADDED> reason<FailedScheduling> message<0/4 nodes are available: 2 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector.> type<Warning> count<1> lastTimestamp<2022-10-11 10:20:48 +0000 UTC>"}

Seems to me some cleanup step is not doing its job. I'll leave it as is for now and report back if anything else happens.

Here's a quick recap of what's happening:

  • This is a Kubernetes error (not related to Testground)
  • Our own resources check here determines that we have enough resources to run the desired number of pods
  • However, for some reason (we have not been able to discover the cause nor reproduce it), Kubernetes returns the error above, and the error will persist until all pods have been managed by the Kubernetes scheduler

Should this error occur too frequently, I would suggest opening up an investigation, and adding a custom cleanup step to the watch func. Something like:

if event.reason == "FailedScheduling" {
 // abort run
// instruct k8s to terminate all run pods and stop scheduling them
}

@galargh
Copy link
Contributor

galargh commented Nov 2, 2022

I tried running storm with 400 instances on the default settings. After 300 pods were schedules, I started seeting:

Oct 11 10:14:54.499138	�[33mWARN�[0m	testplan received event	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "event": "obj<tg-benchmarks-cd2k0qp26un3g1004810-single-303> type<MODIFIED> reason<FailedScheduling> message<0/4 nodes are available: 2 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector.> type<Warning> count<7> lastTimestamp<2022-10-11 10:14:53 +0000 UTC>"}

It doesn't seem to back off on its' own in a situation like that. Instead it keeps trying to schedule pods. Is that expected? Is there a timeout? Max number of tries set?
I killed the run. Then I saw a bunch of things like this in the logs:

Oct 11 10:19:08.485745	�[35mDEBUG�[0m	deleting pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399"}
Oct 11 10:19:08.485800	�[31mERROR�[0m	couldn't remove pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399", "err": "context canceled"}

Now I scheduled a run with 200 instances (which worked previously) but I'm still seeing scheduling errors in the logs:

Oct 11 10:20:48.127055	�[33mWARN�[0m	testplan received event	{"runner": "cluster:k8s", "run_id": "cd2k6u126un3g100481g", "event": "obj<tg-benchmarks-cd2k6u126un3g100481g-single-194> type<ADDED> reason<FailedScheduling> message<0/4 nodes are available: 2 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector.> type<Warning> count<1> lastTimestamp<2022-10-11 10:20:48 +0000 UTC>"}

Seems to me some cleanup step is not doing its job. I'll leave it as is for now and report back if anything else happens.

Here's a quick recap of what's happening:

  • This is a Kubernetes error (not related to Testground)
  • Our own resources check here determines that we have enough resources to run the desired number of pods
  • However, for some reason (we have not been able to discover the cause nor reproduce it), Kubernetes returns the error above, and the error will persist until all pods have been managed by the Kubernetes scheduler

Should this error occur too frequently, I would suggest opening up an investigation, and adding a custom cleanup step to the watch func. Something like:

if event.reason == "FailedScheduling" {
 // abort run
// instruct k8s to terminate all run pods and stop scheduling them
}

I don't quite get this explanation. I understand why we can't schedule new things when all of the above happens but why doesn't our cleanup work?

Specifically this:

Oct 11 10:19:08.485745	�[35mDEBUG�[0m	deleting pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399"}
Oct 11 10:19:08.485800	�[31mERROR�[0m	couldn't remove pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399", "err": "context canceled"}

Are we doing that in the wrong context? Is it too late? Can we do something about it? I'd say that when you stop a plan, the expectation is for all the pods to be removed immediately.

@brdji
Copy link
Contributor

brdji commented Nov 8, 2022

I don't quite get this explanation. I understand why we can't schedule new things when all of the above happens but why doesn't our cleanup work?

Specifically this:

Oct 11 10:19:08.485745	�[35mDEBUG�[0m	deleting pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399"}
Oct 11 10:19:08.485800	�[31mERROR�[0m	couldn't remove pod	{"runner": "cluster:k8s", "run_id": "cd2k0qp26un3g1004810", "pod": "tg-benchmarks-cd2k0qp26un3g1004810-single-399", "err": "context canceled"}

Are we doing that in the wrong context? Is it too late? Can we do something about it? I'd say that when you stop a plan, the expectation is for all the pods to be removed immediately.

The problem seems be caused by the pod scheduling request sent to kubernetes:

  • Testground requests a number of pods to be launched
  • Kubernetes server accepts the request, and sends the pods to be launched one-by-one to the scheduler
  • The error causes the pods to be stopped and deleted (again, one-by-one)
  • Stopping the plan (e.g. using the terminate command) will only stop pods that have already been started, while the scheduling request itself remains in the k8s server/scheduler.

The error couldn't remove pod is caused by our cleanup func . The func attempts to delete the pod, but it has already been deleted by the k8s cluster.

To summarize: while I believe we can do something about it (ie. once the error occurs, cancel the scheduling request, etc.), I don't think it is easy to get to the bottom of this issue, and it would require opening a more in-depth investigation, one that is out of scope of this task.

@galargh
Copy link
Contributor

galargh commented Nov 11, 2022

Thanks for the updates, feels much better now :) Here's my feedback from the most recent testing session: testground/testground#1518

Copy link
Contributor

@laurentsenta laurentsenta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing and updating the PR,
We can merge as is and close testground/testground#1499, congrats @dektech and team !

The Follow-up task list is in testground/testground#1500.

@laurentsenta laurentsenta changed the title EKS script and guide eks: Implement deploy script and guide for EKS Nov 28, 2022
@laurentsenta laurentsenta merged commit d17c372 into master Nov 28, 2022
laurentsenta added a commit to testground/testground that referenced this pull request Nov 28, 2022
This review contains all the required changes for the new EKS cluster. Most of the changes relate to network annotations, IP ranges, configuration options, etc.

Closes #1499
A related change in the infra: testground/infra#78

Co-authored-by: AbominableSnowman730 <abominablesnowman730@gmail.com>
Co-authored-by: LudiSistemas <portalscg@gmail.com>
Co-authored-by: Laurent Senta <laurent@singulargarden.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

7 participants