Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use solver result caching in CI #1908

Merged
merged 34 commits into from
Sep 14, 2023
Merged

Use solver result caching in CI #1908

merged 34 commits into from
Sep 14, 2023

Conversation

m-yac
Copy link
Contributor

@m-yac m-yac commented Aug 14, 2023

This PR makes three changes:

  1. Enables solver result caching on the CI for the integration tests and all the s2n tests
  2. Adds a --clean-solver-cache command-line option to saw which provides a direct way to remove cache entries with out-of-date solver versions (this is run every time a cache is loaded on CI)
  3. Uses the commit hash of the last commit to modify the rme directory as the version information for rme calls, instead of the commit hash of the entire repo (this ensures cached rme calls are actually used when CI is run on a later commit which does not modify the rme directory)

The result is an average of a 1.57x speedup of the relevant parts of the CI. Because all the tests run in parallel, the only one which affects the overall runtime of the CI is the awslc test. Including solver caching consistently saves 35-40 minutes on this test. For two particular runs, the full breakdown is below – with the most dramatic speedups in bold.

intTests hmac drbg sike bike tls hmac-f awslc blst
Current 14m 1s 10m 13s 3m 12s 4m 26s 9m 14s 8m 21s 5m 11s 77m 20s 17m 21s
w/ Caching 12m 14s 6m 58s 2m 44s 4m 14s 7m 53s 2m 55s 4m 15s 41m 13s 12m 55s
Speedup 1.15x 1.47x 1.17x 1.05x 1.17x 2.86x 1.22x 1.88x 1.34x
Cache Hits 546 1518 18 913 3893 28 80 991 1117

Note that there appears to be no correlation between the number of cache hits and the speedup, likely due to the fact that a removing a few very "difficult" solver queries results in much more speedup than removing lots of "easy" solver queries.

@m-yac m-yac added the tooling: CI Issues involving CI/CD scripts or processes label Aug 14, 2023
@m-yac
Copy link
Contributor Author

m-yac commented Aug 18, 2023

I am on vacation for the rest of the month, so unfortunately I won't be able to work on this again until September. Here's a summary of the status of the changes on this PR.

Non-solver cache related

  • In an attempt to figure out why non-GHC 9.9.4 builds still take 30+ minutes even when there are no haskell code changes, I added ${{ github.ref }} to the keys for the cabal cache. Unfortunately that didn't fix it, but it seems reasonable to add: without it github could choose to restore the cabal cache from master, or from somebody else's branch, rather than from your branch.
  • I fixed a typo in the cabal cache save step where most of the key name was accidentally duplicated. For example, before the fix keys would be saved as "1-cabal-macos-12-9.4.4-02b...601-e06...60a 1-cabal-macos-12-9.4.4-02b...601- " but with the fix it would just be "1-cabal-macos-12-9.4.4-02b...601-e06...60a ".
    - key: ${{ env.CACHE_VERSION }}-cabal-${{ matrix.os }}-${{ matrix.ghc }}-${{ hashFiles(format('cabal.GHC-{0}.config', matrix.ghc)) }}-${{ github.sha }}
    -   ${{ env.CACHE_VERSION }}-cabal-${{ matrix.os }}-${{ matrix.ghc }}-${{ hashFiles(format('cabal.GHC-{0}.config', matrix.ghc)) }}-
    + key: ${{ env.CACHE_VERSION }}-cabal-${{ matrix.os }}-${{ matrix.ghc }}-${{ hashFiles(format('cabal.GHC-{0}.config', matrix.ghc)) }}-${{ github.ref }}-${{ github.sha }}

Solver caching

  • Solver caching for the integration_tests is working! It required setting SAW_SOLVER_CACHE_PATH in $GITHUB_ENV in ci.yml, and it required modifying IntegrationTest.hs to capture the value of SAW_SOLVER_CACHE_PATH when running the tests.
  • Solver caching for the s2n tests is not working yet. The complexity of this piece is that the actual tests are run on docker images. So far, I've needed to set SAW_SOLVER_CACHE_PATH in the -entrypoint.sh scripts, add the cache directory as a volume in s2nTests/docker-compose.yml, and give the cache directory write permissions in ci.yml. Despite all this seemingly working correctly based on my debugging, the cache/save action was not finding the cache directory. However, I just noticed that the step which actually runs the s2n tests happens in a different working-directory. In my latest commit I tried to fix this - so maybe the cache file will actually be saved now! 🤞

@RyanGlScott
Copy link
Contributor

figure out why non-GHC 9.9.4 builds still take 30+ minutes even when there are no haskell code changes

I don't think this is the fault of this PR. GHC 9.4+ uses a different approach to recompilation checking (see this blog post) that allows most of the local Haskell dependencies to be reused directly from the cache rather than rebuilding. Indeed, if you compare this GHC 9.4 CI job versus this GHC 9.2 CI job, the main thing that accounts for the 9.2 job taking so much longer is the Run .github/ci.sh build step, which spends 20+ minutes rebuilding all of the local dependencies.

@kquick
Copy link
Member

kquick commented Aug 18, 2023

  • I added ${{ github.ref }} to the keys for the cabal cache. Unfortunately that didn't fix it, but it seems reasonable to add: without it github could choose to restore the cabal cache from master, or from somebody else's branch, rather than from your branch.

This shouldn't be necessary, as caching is already limited to the current branch: https://github.com/actions/cache#cache-scopes

@kquick
Copy link
Member

kquick commented Aug 18, 2023

  • Solver caching for the s2n tests is not working yet

@m-yac: what are the indicators that we would be looking for to tell if this is working?

@m-yac
Copy link
Contributor Author

m-yac commented Sep 5, 2023

  • Solver caching for the s2n tests is not working yet

@m-yac: what are the indicators that we would be looking for to tell if this is working?

Oh, I just meant that the "Save SMT solver result cache" step of the s2n tests is failing. On the most recent run it failed with:

/usr/bin/tar: s2nTests/cache/lock.mdb: Cannot open: Permission denied
/usr/bin/tar: s2nTests/cache/data.mdb: Cannot open: Permission denied
/usr/bin/tar: Exiting with failure status due to previous errors
Warning: Failed to save: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2
Warning: Cache save failed.

which looks like an easy-to-fix permissions issue. For some reason this doesn't give a red X, unlike previous times this step failed, which is maybe the source of the confusion.

@m-yac m-yac marked this pull request as ready for review September 11, 2023 17:16
@m-yac m-yac requested review from kquick and RyanGlScott September 11, 2023 17:16
Copy link
Contributor

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the table in the PR description is to be believed, then this work will end up saving a ton of time on CI rebuilds in the future. Great work, @m-yac!

(Out of curiosity, how did you obtain the figures in that table? Perhaps this info was dumped somewhere in the CI logs, but I couldn't easily find it.)

.github/workflows/ci.yml Outdated Show resolved Hide resolved
s2nTests/scripts/awslc-entrypoint.sh Outdated Show resolved Hide resolved
@m-yac
Copy link
Contributor Author

m-yac commented Sep 12, 2023

Thanks for the review, @RyanGlScott! With the exception of the number of cache hits, which was from debugging information which is now turned off, I got the numbers in the table directly from Github's web interface. Specifically, from the right hand side when a job is selected. The specific CI runs I used are linked in the first column. For example, after going to this CI run from master, this where I got the upper-left-most time:

Screen Shot 2023-09-12 at 11 31 55

@m-yac
Copy link
Contributor Author

m-yac commented Sep 12, 2023

FYI: Looks like Github evicted the previously built solver caches due to age, thus the first attempt at running the CI on the latest commit latest had to rebuild all the caches from scratch (and so took longer than expected). I'm re-running the CI now that there are some caches to use just to make sure everything still works.

In this future, when this PR is merged, this hopefully won't be a problem because we should always have fresh caches from master to build off of.

@m-yac
Copy link
Contributor Author

m-yac commented Sep 12, 2023

Also note that I forgot all the tests run in parallel when I made my original analysis of the speedup. Taking into account that the awslc test is always the last to finish, the correct figure is that solver caching saves 35-40 minutes overall. Between the two most recent runs mentioned above, solver caching reduced the runtime of that test from 1h 18m 45s to 38m 16s (40m saved).

@m-yac m-yac merged commit 5b6db1f into master Sep 14, 2023
@mergify mergify bot deleted the solver-caching-on-ci branch September 14, 2023 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tooling: CI Issues involving CI/CD scripts or processes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants