Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce CI Workload by Removing Some Spark Variants and Using Callable Workflows for Github Actions #5153

Closed
kbendick opened this issue Jun 28, 2022 · 6 comments
Labels

Comments

@kbendick
Copy link
Contributor

kbendick commented Jun 28, 2022

We now have 30 Github Actions that run as part of the CI test suite, and it's starting to have a noticeable impact on CI runners.

We test Spark with a large number of combinations of Java versions and Scala versions.

We previously only tested the "latest" Spark version (i.e. Spark 3.2) with Scala 2.13.

We are now testing:

  • Spark 2 with Java 8 (1 workflow)
  • Spark 3.0, 3.1, 3.2, 3.3 with Java 8 and Scala 2.12 (4 workflows)
  • Spark 3.0, 3.1, 3.2, 3.3 with Java 11 and Scala 2.12 (4 workflows)
  • Spark 3.2, 3.3 with Java 8 and Scala 2.13 (2 workflows)
  • Spark 3.2, 3.3 with Java 11 and Scala 2.13 (2 workflows)

That brings a total of 13 Spark specific CI variants that run on every PR that touches core or spark.

We should consider reducing the large number of combinations of JRE versions with Scala versions that are run for the various Spark versions, as CI is starting to take a good while longer.

We should also look into (again) refactoring out CI test suites to using callable workflows, such that all tests stem from one root test (very much like an Airflow DAG), so that if any one test fails, they all stop. We get this for free at present for any set of CI suites generated out of one matrix (such as java 11 and java 8 with scala 12).

This will reduce the number of CI slots that are running for tests that will have to be run again (as something else failed).

We can also set up the faster tests first, to ensure they pass, before then calling out to the more expensive tests (such as Spark / Flink etc).

I tried before with the callable workflow, but at the time it wasn't worth the effort. I think now it probably is.

@kbendick
Copy link
Contributor Author

There's a few small changes I can make so that more Spark workflows will terminate early if one fails, but that won't necessarily resolve the problem entirely by any means.

@singhpk234
Copy link
Contributor

singhpk234 commented Jul 9, 2022

+1 on the above, along with this we can also leverage the GitHub Actions resources from the forked repositories instead of using the resources in ASF organisation at GitHub.

This is what Apache Spark does presently :

  • We create a PR and our repository triggers the workflow. Our PR uses the resources allocated to us for testing.
  • Apache Spark repository finds our workflow, and links it in a comment in our PR

relevant PR in spark :

Would love to know your thoughts on the same :)

@HyukjinKwon
Copy link
Member

I happened to read the related links. Thanks @singhpk234 for elaborating Spark's CI. To be more clear, apache/spark#32092 implemented the logic you explained. After that, I also implemented the logic to leverage GitHub check status (apache/spark#32193).

See one example of test results:
Screen Shot 2022-12-14 at 8 53 08 AM
Screen Shot 2022-12-14 at 8 53 14 AM
Screen Shot 2022-12-14 at 8 52 48 AM

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 13, 2022

In this way, we can remove all the overhead in the current repo, and leverage the resources from the forked repositories.
Spark was one of the projects that uses the GitHub resources most in ASF, and now it's one of the lowest after this change :-).

I am willing to help and review if someone tries to pick this changes to Iceberg :-).

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Aug 16, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants