Skip to content

Exploring a baseline Action build #48421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

bhack
Copy link
Contributor

@bhack bhack commented Apr 8, 2021

With this I want to explore a new testing baseline with Github Action and our official CPU tensorflow/tensorflow:devel image.

The idea is to test in the CI the (more or less) Episodic contributor journey to contribute code to Tensorflow at least on CPU.

This is the proposed list of steps:

  • tensorflow/tensorflow:devel image rebuid build (or Dockerhub pull?)
  • Code checkout
  • ci_sanity.sh selected steps (--pylint, -- see Supersed pylint_allowlist #48294)
  • TF bazel ./configure
  • bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
  • bazel test //tensorflow/

As the average user it is already experiencing, this will probably require a bazel cache (on GCS like for TF/IO?) to achieve reasonable compilation times.

I think that reproducibility and the timing of these build steps will let us to monitor the experience of a Tensorflow episodic contribution.

/cc @angerson @mihaimaruseac @theadactyl @joanafilipa

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Apr 8, 2021
@google-cla google-cla bot added the cla: yes label Apr 8, 2021
@gbaned gbaned self-assigned this Apr 9, 2021
@bhack
Copy link
Contributor Author

bhack commented Apr 9, 2021

As expected we had a Github Action timeout on the TensorFlow build step after 5h 56m 42s with only 11,286 compiled target on an estimate of 33563 targets configured.

Github Actions are currently running on a Standard_DS2_v2 machine.

As we already know this is really a bottleneck for an average TF external (Episodic or not) contributor as we ask to reproduce these steps on its own local machine just for preparing an occasional code PR.

I think that it is important to continuously monitor this Action over time to expect that we could execute it in the expected time that it seems to us reasonable for an Episodic/Average TF contributor.

Some proposed solutions to enable this action in order of preference:

  • Use a Google hosted Action runner and produce a GCS cache that could be re-used by the action itself and (read-only) by every contributor on his own local-machine when he is working with the official tensorflow/tensorflow:devel. A sort of an improvement over Enable ready-only bazel cache io#1294
  • Still use Github hosted Action consuming an usable GCS cache produced elsewhere (Where?)
  • Let use a bazel python only build and test commands in this Action using all the c/c++ component from a system pip install ft-nightly. This is risky cause as we build nightly once a day we could have misalignment against current master c/c++ features.

@bhack
Copy link
Contributor Author

bhack commented Apr 13, 2021

Just in the case we want to explore the first option with the self hosted Github Action runner on GKE:
https://github.com/summerwind/actions-runner-controller
https://github.com/evryfs/github-actions-runner-operator/

@bhack
Copy link
Contributor Author

bhack commented Apr 13, 2021

There is also a Terraform Github Self Hosted Runners on GKE repo maintained by Google Cloud members (/cc @bharathkkb) at https://github.com/terraform-google-modules/terraform-google-github-actions-runners

@bhack
Copy link
Contributor Author

bhack commented Apr 14, 2021

/cc @perfinion If we can do some steps together on this.

@bhack
Copy link
Contributor Author

bhack commented Apr 15, 2021

Update: We discussed a pilot plan with @perfinion yesterday on SIG-Build Gitter.

@vnghia
Copy link
Contributor

vnghia commented Apr 15, 2021

I would add one more difficulty: Even with local cache, it seems to be invalidated each time I pull the commits from upstream. ( I think LLVM-related commits like 17e6dc2 are the culprits ).

@bhack
Copy link
Contributor Author

bhack commented Apr 15, 2021

I would add one more difficulty: Even with local cache, it seems to be invalidated each time I pull the commits from upstream. ( I think LLVM-related commits like 17e6dc2 are the culprits ).

What cache command are you using?

@vnghia
Copy link
Contributor

vnghia commented Apr 15, 2021

I am using --disk_cache. I noice that I have a much longer build ( around 8~10 hours ) everytime there is one LLVM-related commit ( which is pretty much daily but I don't pull upstream that often ).

@vnghia
Copy link
Contributor

vnghia commented Apr 15, 2021

I found that in #40505 (comment), @mihaimaruseac said the same thing. Do you have any problem regarding this issue @bhack ?

@bhack
Copy link
Contributor Author

bhack commented Apr 15, 2021

I found that in #40505 (comment), @mihaimaruseac said the same thing. Do you have any problem regarding this issue @bhack ?

We are waiting to have a bootstrapped GCS cache for this action produced with a fresh master build in tensorflow/tensorflow:devel

@bhack
Copy link
Contributor Author

bhack commented Apr 15, 2021

If the llvm sync will totally invalidate the remote bazel cache we cannot use Github Action but we need to use self hosted github Actions as suggested in #48421 (comment).

@gbaned
Copy link
Contributor

gbaned commented Jun 25, 2021

@bhack This PR is in draft, any update on this? Please. Thanks!

@bhack
Copy link
Contributor Author

bhack commented Jun 25, 2021

@gbaned It is a draft cause as you can see the introduced action go in Timeout on Github.
I am waiting on an agreement on how I could use a GCS readonly cache with the infra/build team /cc @mihaimaruseac @angerson

@bhack
Copy link
Contributor Author

bhack commented Oct 14, 2021

Just for reference, it is going in timeout on this kind of HW resources:

https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

@bhack bhack force-pushed the docker_devel_action branch from 1b12917 to 3175de1 Compare October 25, 2021 23:26
@gbaned gbaned requested a review from mihaimaruseac December 28, 2021 16:42
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Dec 28, 2021
@gbaned gbaned removed the awaiting review Pull request awaiting review label Feb 9, 2022
@bhack
Copy link
Contributor Author

bhack commented Sep 27, 2022

Closing this for #57630

@bhack bhack closed this Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes size:M CL Change Size: Medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants