Skip to content

[CI] Add AWS EC2 dynamic runner support #6471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Aug 11, 2022
Merged

[CI] Add AWS EC2 dynamic runner support #6471

merged 52 commits into from
Aug 11, 2022

Conversation

apstasen
Copy link
Contributor

@apstasen apstasen commented Jul 23, 2022

This adds infrastructure to spawn AWS EC2 runners dynamically for lts suite testing. This will be only functional if you will add "aws-type" keys as well as other keys into devops/test_configs.json configuration file like this:

 {
      "config": "hip_amdgpu",
      "name": "HIP AMDGPU LLVM Test Suite",
      "runs-on": "aws-amdgpu_${{ inputs.uniq }}",
      "aws-ami": "ami-0ccda708841dde988",
      "aws-type": [ "g4ad.2xlarge", "g4ad.4xlarge" ],
      "aws-spot": false,
      "aws-disk": "/dev/xvda:64",
      "image": "${{ inputs.amdgpu_image }}",
      "container_options": "--device=/dev/dri --device=/dev/kfd",
      "check_sycl_all": "hip:gpu,host",
      "cmake_args": "-DHIP_PLATFORM=\"AMD\" -DAMD_ARCH=\"gfx1031\""
    },
    {
      "config": "cuda",
      "name": "CUDA LLVM Test Suite",
      "runs-on": "aws-cuda_${{ inputs.uniq }}",
      "aws-ami": "ami-02ec0f344128253f9",
      "aws-type": [ "g4dn.2xlarge", "g4dn.4xlarge" ],
      "aws-disk": "/dev/xvda:64",
      "image": "${{ inputs.cuda_image }}",
      "container_options": "--gpus all",
      "check_sycl_all": "cuda:gpu,host",
      "cmake_args": ""
    }

Also please make sure that other non AMD/nVidia GPU jobs do not have too generic self-hosted runner labels like "Linux", "x64" since otherwise they can go to these AWS hosts and we do not want to use them for generic workloads.

Intel provided AWS account is supposed to be used. To configure it for this repo please do the following (I will keep this BKM schematic to avoid disclosing any sensitive info):

  1. Login to AWS Intel account as admin
  2. To go IAM users (https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/users)
  3. Click "Add users"
  4. Select "Access key - Programmatic access"
  5. Copy permissions from existing user (sycl-ci)
  6. Get new user AWS key and secret key strings (keep them private until step 11).
  7. Delete original user sycl-ci (so I can no longer use this AWS account for apstasen/llvm repo for test purposes)
  8. Got to https://github.com/intel/llvm/settings/secrets/actions
  9. Create "aws" environment and make sure you select required reviewers for extra security (they need to pay special attention that PRs do not expose secrets by making changes workflow .yml and devops .js files)
  10. Create AWS_ACCESS_KEY and AWS_SECRET_KEY secrets using obtained new AWS AMI user key strings.
  11. Destroy all copies of AWS key and secret key strings (except ones stored as github "aws" environment secrets)
  12. Create repository (or even better put them into "aws" environment too for better security) secret GH_PERSONAL_ACCESS_TOKEN (with Github api key with "repo" permissions)

@apstasen apstasen requested a review from a team as a code owner July 23, 2022 22:57
@bader
Copy link
Contributor

bader commented Jul 24, 2022

@apstasen, do you know if it's possible to get remote access to the machines from AWS EC2 for debugging failures?

@bader bader changed the title Added AWS EC2 dynamic runner support [CI] Add AWS EC2 dynamic runner support Jul 24, 2022
@apstasen
Copy link
Contributor Author

Yes, it is possible. Even if you have non admin access to this Intel provided AWS account you can create your SSH keypair in AWS, run instance from my pre-created AWS AMI (or use generic ones) with that keypair and SSH port open. After that you can access this host using usual SSH client (need to be outside Intel network or use Intel socks5 proxy). Will not put specific details here about this proxy.

Dynamically created AWS instances in this PR use "default" security group that have all incoming connections blocked, so you will not be able to access these instances using SSH. Of course admin can can open SSH port in this default security group but it is not recommended to do (and not convenient since these instances are normally short lived).

@bader
Copy link
Contributor

bader commented Jul 24, 2022

Will the logs be publicly available? We have non-Intel developers who ideally should be able to debug pre-commit issues and having access to logs is highly desirable (access to HW would be ideal).

@apstasen
Copy link
Contributor Author

Logs from these runners will be visible as usual in Github actions interface, so if developers can see logs from our persistent runner they can see these logs too.

@bader bader added the disable-lint Skip linter check step and proceed with build jobs label Jul 24, 2022
@bader
Copy link
Contributor

bader commented Jul 24, 2022

According to my understanding CI linter is not supposed to applied to Javascript, so I suggest reverting 565732b to more and return more readable version.
I've added "ignore-lint" label, which should help to unblock pre-commit CI jobs.

@apstasen
Copy link
Contributor Author

@bader OK. Restored original formatting. Also this PR can not be merged until "aws" secret environment is created (otherwise newly added aws-start-matrix and aws-stop-matrix jobs will fail).

@bader bader requested a review from pvchupin July 26, 2022 10:31
@pvchupin pvchupin closed this Aug 4, 2022
@apstasen apstasen requested a review from pvchupin August 6, 2022 02:24
Copy link
Contributor

@pvchupin pvchupin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pvchupin pvchupin merged commit ee781e4 into intel:sycl Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disable-lint Skip linter check step and proceed with build jobs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants