Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parallel flag in bats #1128

Merged
merged 9 commits into from
Jul 23, 2021
Merged

Enable parallel flag in bats #1128

merged 9 commits into from
Jul 23, 2021

Conversation

timja
Copy link
Member

@timja timja commented Jun 25, 2021

Co-authored-by: Damien Duportal damien.duportal@gmail.com

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@timja
Copy link
Member Author

timja commented Jun 25, 2021

Error: Cannot execute "10" jobs without GNU parallel

@dduportal
Copy link
Contributor

@timja aaaah sorry I opened #1130. Can you merge the part of my work that would look interesting and we stay here?

@dduportal dduportal mentioned this pull request Jun 26, 2021
6 tasks
@timja
Copy link
Member Author

timja commented Jun 26, 2021

@dduportal I've updated the AMI's on ci.jenkins.io, deleted all the idle agents.

but there's plenty in use, so will probably try this later on

@dduportal
Copy link
Contributor

@timja as you might have seen, some changes had to be done to the tests. These changes might need to be extended though (I might have missed some):

  • In the same idea as INFRA-3016 Pre-create volume for permissions propogation update #1125, all the docker run setting a custom user and/or using a volume where adapted to use an anonymous data volume, to not depend on the user/uid on the host
    • Anonymous to avoid having to define a name (due to the risks of collision while UUID are... unique )
    • The goal of the volumes are always to check the plugins within. The function related to unziping a plugin using a 2nd docker run step, the fix was easy
  • There are a some flags changed from short form to long form (improve readability and decrease the knowledge bar for other contributors). It's not exhaustive (I tried to stay on the instructions I changed)
  • shellcheck and code simplification (wrapping in bash -c is only needed when there is a piped instruction to be run on the subshell, otherwise wrapping in a bash -c after the run instruction only harms readability and makes sub instructions harder to check for shellcheck

@dduportal
Copy link
Contributor

Performances informations (while we are working on making this pass on the CI + check performances in CI):

Benchmarked on my MacBookPro, i7 2,6 Ghz 6c/12t (e.g. 12 parallel jobs), 16 Go, local NVME, with Docker Desktop in a 2 vCPUs / 4Gb virtual machine

  • Only running the make test-alpine suite at first, to evaluate if there is an impact
  • A first build is done to cache the initial image, as the focus is on the test step. It also serves as validating that the whole test suite is passing before the benchmark
  • Ran 3 times the test harness with Bats v1.3.0 from the head of chore: Bump Bats to 1.3.0 #1127, with a docker system prune --volumes --force between each try: got 6:04, 6:24 and 6:28 (:)
  • Ran 3 times the test harness from commit c55193e (head of current PR when writing this comment): got 2:04, 1:57 and 1:59

=> The impact is 3x faster when only running the Alpine test suite.

It means that we can gain something, but we have to evaluate now:

  • What happen when running ALL test suites on a single machine (as the pipeline defines): what is the threshold before hitting resource contention that would deserve the benefit of parallelization
  • What happen specifically on ci.jenkins.io environment

@dduportal
Copy link
Contributor

dduportal commented Jun 26, 2021

Ok, todo list:

  • Fix the parallel build "from scratch" (it was invisible on our dev machines because we already ran the test suite before):
    • All the tests have a dependency on the first test of each suite, which builds the image. The make build is not setting a name. Seems like it can be simplified in the makefile as it is a requirement.
  • If we validate that it works:
    • We should provide a kill switch for parallelization: if parallel is not installed, or if the executor defines the variable NO_PARALLELIZATION (e.g. NO_PARALLELIZATION=true make test)
    • Doc update
    • Parallelize builds in the makefile as it's also quite slow
    • State what we do for Windows (separate issue/PR)? As bash, make, docker an bats can run in Windows, maybe there something to avoid maintaining 2 different test suites

@dduportal
Copy link
Contributor

dduportal commented Jun 26, 2021

@timja I don't have the rights to push to your branch (I assume because I'm not maintainer of the repo here), so I pushed a commit to test here: dduportal@04a3ac1 nevermind, you gave me access, commit pushed here

@dduportal dduportal force-pushed the parallel-bats branch 4 times, most recently from 53fd06a to 0e372d2 Compare June 28, 2021 11:45
@dduportal dduportal force-pushed the parallel-bats branch 6 times, most recently from f08e8c4 to f867ff5 Compare July 12, 2021 19:02
@dduportal
Copy link
Contributor

Current status:

  • For each "Linux branch" of the pipeline, there is a different agent which checkouts, build and test the linux distribution.

  • Parallelization works for Linux: all the linux platforms are tested in less than 3 min (from 0:25min to 2:40min) instead or around ~ 6 min. it's a gain of twice without any further optimization (1 test per core available to Docker) => it's worht the effort

    • No time penalty for the build: current master branch builds all linux in ~1m30s, while here we build between 0:25min up to 1:56min)
    • The only downside is the agent allocation time that could slow down the overall process
  • All tests are independant from each other: some tests where not making sense or had duplicated assertions with others tests so it has been removed

  • Windows is the slowest part now and is constraining the whole process: both windows builds and windows tests are NOT parellelized. I propose to work on a subsequent PR for this: it means that the current PR won't show immediatly its effects BUT it will be used to ensure there is no regression

timja and others added 2 commits July 13, 2021 11:26
- Split Pipeline parallel branches per linux platform (1 agent per platform for build + test)
- Makefile support: dependencies, full usage of docker-bake for image names and bats support for `--jobs`

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
- Disable test parallel execution if `docker` of GNU `parallel` are absent
- Allow developers to disable parallel execution by setting the variable `DISABLE_PARALLEL_TESTS` to `true`
- Update documentation on `HACKING.adoc`

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

@timja @MarkEWaite @slide @saper @olivergondza I don't have the maintainers rights on this repo so I cannot put it as "ready for review", neither edit the message so I'm mentionning you here.

TL;DR;

  • This PR is ready for review ✅ 👀
  • The performance increase is around 2x faster for running all the tests on a given linux platform (compared to master branch)
  • There is a change introduced in the pipeline to spawn an agent per linux platform. Each agent hosts the "build" AND the "test" phases for a given platform.
    • It ensures faster feedback (test benefits from the image already cached) without putting too much pressure on machines (JNLP processes tend to hang when CPUs is used a lot)
  • It's restricted to Linux only: the outcomes in term of "pipeline speed" are only appreciable on a given branch, while Windows is still slowing down the whole process. Expect another PR for Windows only.
  • Parallel execution of tests:
    • can be disabled by an environement variable as an escape hatch (to support wider audience of contributors)
    • is disabled if the prerequisites are not met (to support wider audience of contributors)
    • is executed in batch of 2*(amount of vCPUs available for the Docker Engine) (there is a rough ~15% gain from 1x to 2x because tests are mainly network bound, but going to 3x only gain 2-3s so better leveraging the pressure on CPU cors)
  • Documentation is updated for this change and the escape hatches 🥳

(ref. #1120 (comment))

@MarkEWaite MarkEWaite marked this pull request as ready for review July 13, 2021 13:34
@MarkEWaite MarkEWaite requested a review from a team as a code owner July 13, 2021 13:34
@dduportal dduportal mentioned this pull request Jul 13, 2021
6 tasks
@saper
Copy link
Collaborator

saper commented Jul 14, 2021

New attempt to run make build on Debian 10 VM with docker 20 gives me error: failed to solve: rpc error: code = Canceled desc = grpc: the client connection is closing again.

@saper
Copy link
Collaborator

saper commented Jul 15, 2021

I have completed make build with 5f03037 on Fedora 33, but make test fails with for all images:

make[1]: Entering directory '/home/saper/space/docker'
git submodule update --init --recursive
mkdir -p target
parse error: Invalid numeric literal at line 1, column 5
make[3]: *** [Makefile:44: build] Error 255
make[2]: *** [Makefile:47: list] Error 4
parse error: Invalid numeric literal at line 1, column 5
make[3]: *** [Makefile:44: build] Error 255
make[2]: *** [Makefile:47: list] Error 4
Error: the image 'alpine_jdk8' does not exist in manifest. Please check the output of make[2]: Entering directory '/home/saper/space/docker'
make[2]: Leaving directory '/home/saper/space/docker'. Exiting.

I noticed also that make clean does not clean everything, even if I remove docker images things get re-generated quickly from some build results.

@saper
Copy link
Collaborator

saper commented Jul 15, 2021

Changing the list target to:

list: check-reqs
        @make --silent -C $(CURDIR) build BAKE_ADDITIONAL_FLAGS=--print | tee /tmp/list1 | jq -r '.target | keys[]' | tee /tmp/list2

shows that "make list" runs fine, but in "make test" one gets

make[3]: Entering directory '/home/saper/space/docker'

in /tmp/list1 file.

Personally I don't like using $(call) and $(shell) in the Makefiles. I think the @for loop could be replaced for example with a while shell loop:

         @make --silent list | while read image do make "test-$${image}"; done

(that makes inserting "tee" for debugging easy), and then @make --silent list | tee /tmp/list3 | ... confuses me totally provides normal list of targets despite /tmp/list1 and /tmp/list2 being wrong!

Changing another line to

      check_image = make --silent list | tee /tmp/list4 | grep -q -w '$(1)' || { echo "Error: the image '$(1)' does not exist in manifest. Please check the output of `make list`. Exiting." ; exit 1 ; }

(I think we can add -q and get rid of grep redirection) produces

make[2]: Entering directory '/home/saper/space/docker'
make[2]: Leaving directory '/home/saper/space/docker'

in /tmp/list4

This is crazy, we should get rid of all this pseudo-make-shell programming...

@dduportal
Copy link
Contributor

Hello @saper, thanks for your feedback.

  • Have you tried with the latest commit (e.g. fb7bb3c)? If no can you try, because the error message parse error: Invalid numeric literal at line 1, column 5 should have been fixed (it was an error from me)?
  • Could you explain why you are adding tee commands? I'm not sure to understand what is the problem, and how does it helps?

@dduportal
Copy link
Contributor

@saper thanks for the reminder for using while instead of for in the loop. I reused for since it was already there (one change at a time), but your proposal with while should be more solid and less prone to special character/space issues.

Let's tackle one problem after the other.

@saper
Copy link
Collaborator

saper commented Jul 15, 2021

  • Have you tried with the latest commit (e.g. fb7bb3c)? If no can you try, because the error message parse error: Invalid numeric literal at line 1, column 5 should have been fixed (it was an error from me)?

Yes, I am on fb7bb3c trying to make test on a previously successful make build.

  • Could you explain why you are adding tee commands? I'm not sure to understand what is the problem, and how does it helps?

Most probably the parse error: Invalid numeric literal at line 1, column 5 comes from jq. so I use tee to see what input did jq receive and what came after it. This is just interim logging to troubleshoot those issue. What wonders me why make list comes clean but invoking it via recursive make does not.

I am using GNU make 4.3

@dduportal
Copy link
Contributor

Good, thanks for confirming. I had this issue before, when I forgot the flag --silent when calling make recursively. It sounds like it can be different, due to something else. Can you try (I'll do the same) without the -C $(CURDIR) which I added only by reflex but that could be avoided somehow?

I'm currently using make 3.81 on macOS: let me try to run the same in Debian with the same make 4.3 as yours.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

@saper thanks for the explanation. Never thourgh to use tee to debug a pipeline like this, great discovery, thanks for sharing 👍

I was able to reproduce the issue on a local Debian 10 with Make 4.2.1.

  • There were 2 backticks characters leftovers in a user message, which was triggering unexpected sub shells.
  • Each sub call to make list are now using the flag --silent to avoid stdout pollution
  • Your suggestion about the shell loop (for -> while had been applied) so it should be less fragile
  • Removed the -C $(CURDIR) flags that are unused and having different behaviors between make 3.x and make 4.x

Can you retry with 3e9ef8b please?

build:
docker buildx bake -f docker-bake.hcl --set '*.platform=linux/amd64' --load linux

build-arm64:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can I build arm images now using make? I'm unable to run the default make build because alpine doesn't have an arm image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh good point. We might be able to retrieve the amount of CPUs from the output of docker info instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed timja#1

@timja
Copy link
Member Author

timja commented Jul 18, 2021

I've been running the tests locally and wasn't able to make them pass,

I eventually tracked it down to:
https://docs.docker.com/docker-for-mac/release-notes/#known-issues

curl is broken on debian:buster and some other platforms on Mac M1 with docker

@dduportal
Copy link
Contributor

@timja thanks for testing! Good catch with curl. How would you want to go forward from here? Expliciting that we do not support ARM for testing platform temporarly, or find another solution such as getting a containerized curl or something?

@timja
Copy link
Member Author

timja commented Jul 18, 2021

@timja thanks for testing! Good catch with curl. How would you want to go forward from here? Expliciting that we do not support ARM for testing platform temporarly, or find another solution such as getting a containerized curl or something?

I think it's just an issue with docker on mac m1, so for now I would document it and ignore, which I've done in timja#1

Co-authored-by: Damien Duportal <damien.duportal@gmail.com>
@timja timja requested a review from dduportal July 20, 2021 15:31
@timja
Copy link
Member Author

timja commented Jul 20, 2021

Any thing left on this @dduportal?

@jenkinsci/team-docker-packaging please review

@dduportal
Copy link
Contributor

@timja weird: the build fails for all images. Let's check.

@timja
Copy link
Member Author

timja commented Jul 20, 2021

Make version issue? It passes for me on my mac

@dduportal
Copy link
Contributor

dduportal commented Jul 20, 2021

yep, that is what I thought. Trying reproduction with:

$ docker run --privileged --detach -v $(pwd):$(pwd) -w $(pwd) --name dind docker:20.10-dind
$ docker exec -ti -u root dind sh
/ # apk add --update bash curl make parallel git jq
/ # mkdir -p ~/.docker/cli-plugins/
/ # curl -sSL -o ~/.docker/cli-plugins/docker-buildx https://github.com/docker/buildx/releases/download/v0.6.0/buildx-v0.6.0.linux-amd64
/ # chmod a+x ~/.docker/cli-plugins/docker-buildx
/ # make build-alpine_jdk8

But it works 🤔

@dduportal
Copy link
Contributor

Found the issue: it's the jq version (@timja and I, and Alpine 3.13/3.14 have jq v1.6 while our Ubuntu 18.04 agents have jq 1.5). Work in progress

@dduportal
Copy link
Contributor

  • Ubuntu 18.04 only provides jq 1.5. We need Ubuntu 20.04 to have jq 1.6
  • There is standalone binary distribution of jq 1.6, but it's Intel only (32 or 64). So no ARM support from bianry (only from packages)
    => if there is someone good at jq , maybe the expression we use on the list: target could be made compliant with 1.5?

@timja
Copy link
Member Author

timja commented Jul 20, 2021

Upgrading to 20.04 shouldn't be hard should it?

@dduportal
Copy link
Contributor

@timja yep, that should work nicely. Let's do it in the upcoming days.

@timja
Copy link
Member Author

timja commented Jul 20, 2021

@timja
Copy link
Member Author

timja commented Jul 21, 2021

Build is green now after updating the agents to 20.04

Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the HACKING doc to require jq 1.6 rather than 1.5, since 1.6 works for me on Ubuntu 18 amd64 while jq 1.5 does not work for me on Ubuntu 18.

I haven't been able to test on Amazon Linux with arm64 because jq 1.6 is not available there and the components to build jq 1.6 don't appear to be available either (needs a newer version of automake than is included).

HACKING.adoc Outdated Show resolved Hide resolved
@timja timja merged commit 3bebda7 into jenkinsci:master Jul 23, 2021
@timja timja deleted the parallel-bats branch July 23, 2021 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants