-
Notifications
You must be signed in to change notification settings - Fork 199
Add nightly benchmarks for remote caching & execution #1864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
this is pretty good except the commit hash we want to trigger a buiild on should be the Nativelink commit hashes. IT's the nativelink performance we are evaluating against a single LLVM checkpoint that is unchanged. |
@MarcusSorealheis I've updated the benchmarks as requested in commit 1b411ad5791947. |
You should be showing all the commits in the nativelink project. Basically, build the fixed version of LLVM and then each plot on the graph should be each nativelink commit. There should be over hundreds plots on the graph, just like the the Lucene graph. |
The Nativelink project has more than a 1000 commits right now so it would take a while to show all of them on the chart. Besides, any attempts to show hundreds of plots will easily use up GitHub Actions runner minutes.
The Lucene project racked up all those plots by running nightly for years so I assume that the same would also apply here i.e. letting the benchmarks run for a while, right? Or perhaps you want this to be sped up by being done in a loop? If you mean the latter, I think it is important for us to zoom out for a bit. Let's say I modify the benchmark scripts to enumerate the last 100 commits to the Nativelink repo and then run the benchmarks against each of those commits in a loop a 100 times i.e. run it and record the results for commit 100, commit 99, commit 98 ... until the current commit, it wouldn't really tell us anything since my benchmark scripts does not control the (fixed) RBE environment at https://app.nativelink.com/ that it is using for remote caching and execution. In other words, the only way to get a meaningful signal from repeated runs of the benchmarks is if you provide an RBE environment in the cloud that I can control (using Terraform) from the benchmarks. In this case, at the start of the benchmarks, it would use Terraform to deploy the Nativelink code at commit 100 into a spot instance on AWS,
until we get to the current commit. Without access to cloud infra from the benchmarks, I don't really see any way around this. Thoughts? |
I can provide a cloud environment. |
@ayewo please create a Google Cloud project and only go back 15 commits. It should not come close to exceeding the free tier if you just set it up and run a few builds. Once I have a better idea, I can share a card with you for this cloud account directly. It's unlikely to exceed 400 per month. Please invite me to the GCP project. Message me directly on Slack and I can provide the card info. |
@MarcusSorealheis Perhaps another approach might be to create a new dev or staging account in GCP (or use your existing one), apply budgetary limits, then create an IAM user for me that will allow me to create/delete spot instances? That way you have complete control over any costs that might be incurred. |
@ayewo I am sharing a credit with you to your email so that you can use it for this test. |
After exploring self-hosting Nativelink on my personal AWS account, preliminary base costs reported by infracost breakdown --path .
INFO Autodetected 1 Terraform project across 1 root module
INFO Found Terraform project main at directory .
WARN 1 aws_sns_topic price missing across 1 resource
Project: main
Name Monthly Qty Unit Monthly Cost
aws_instance.build_nativelink_instance["x86"]
├─ Instance usage (Linux/UNIX, on-demand, c6id.2xlarge) 730 hours $294.34
└─ root_block_device
└─ Storage (general purpose SSD, gp3) 8 GB $0.64
aws_instance.build_nativelink_instance["arm"]
├─ Instance usage (Linux/UNIX, on-demand, c6gd.2xlarge) 730 hours $224.26
└─ root_block_device
└─ Storage (general purpose SSD, gp3) 8 GB $0.64
aws_autoscaling_group.scheduler_autoscaling_group
└─ aws_launch_template.scheduler_launch_template
└─ Instance usage (Linux/UNIX, on-demand, c6g.xlarge) 730 hours $99.28
aws_autoscaling_group.cas_autoscaling_group
└─ aws_launch_template.cas_launch_template
└─ Instance usage (Linux/UNIX, spot, r6g.xlarge) 730 hours $39.20
aws_autoscaling_group.worker_autoscaling_group_x86_2cpu
└─ aws_launch_template.worker_launch_template["x86"]
└─ Instance usage (Linux/UNIX, spot, m6id.large) 730 hours $23.51
aws_lb.cas_load_balancer
├─ Application load balancer 730 hours $16.43
└─ Load balancer capacity units Monthly cost depends on usage: $5.84 per LCU
aws_lb.scheduler_load_balancer
├─ Application load balancer 730 hours $16.43
└─ Load balancer capacity units Monthly cost depends on usage: $5.84 per LCU
aws_vpc_endpoint.aws_api_ec2_endpoint
├─ Data processed (first 1PB) Monthly cost depends on usage: $0.01 per GB
└─ Endpoint (Interface) 730 hours $7.30
aws_autoscaling_group.worker_autoscaling_group_arm_1cpu
└─ aws_launch_template.worker_launch_template["arm"]
└─ Instance usage (Linux/UNIX, spot, c6gd.medium) 730 hours $3.21
aws_lambda_function.update_scheduler_ips_lambda
├─ Requests Monthly cost depends on usage: $0.20 per 1M requests
├─ Ephemeral storage Monthly cost depends on usage: $0.0000000309 per GB-seconds
└─ Duration (first 6B) Monthly cost depends on usage: $0.0000166667 per GB-seconds
aws_route53_record.cas_lb_domain_record_cert_verify
├─ Standard queries (first 1B) Monthly cost depends on usage: $0.40 per 1M queries
├─ Latency based routing queries (first 1B) Monthly cost depends on usage: $0.60 per 1M queries
└─ Geo DNS queries (first 1B) Monthly cost depends on usage: $0.70 per 1M queries
aws_route53_record.scheduler_lb_domain_record_cert_verify
├─ Standard queries (first 1B) Monthly cost depends on usage: $0.40 per 1M queries
├─ Latency based routing queries (first 1B) Monthly cost depends on usage: $0.60 per 1M queries
└─ Geo DNS queries (first 1B) Monthly cost depends on usage: $0.70 per 1M queries
aws_s3_bucket.access_logs
└─ Standard
├─ Storage Monthly cost depends on usage: $0.023 per GB
├─ PUT, COPY, POST, LIST requests Monthly cost depends on usage: $0.005 per 1k requests
├─ GET, SELECT, and all other requests Monthly cost depends on usage: $0.0004 per 1k requests
├─ Select data scanned Monthly cost depends on usage: $0.002 per GB
└─ Select data returned Monthly cost depends on usage: $0.0007 per GB
aws_s3_bucket.cas_bucket
└─ Standard
├─ Storage Monthly cost depends on usage: $0.023 per GB
├─ PUT, COPY, POST, LIST requests Monthly cost depends on usage: $0.005 per 1k requests
├─ GET, SELECT, and all other requests Monthly cost depends on usage: $0.0004 per 1k requests
├─ Select data scanned Monthly cost depends on usage: $0.002 per GB
└─ Select data returned Monthly cost depends on usage: $0.0007 per GB
aws_sns_topic.update_scheduler_ips_sns_topic
├─ API requests (over 1M) Monthly cost depends on usage: $0.50 per 1M requests
├─ HTTP/HTTPS notifications (over 100k) Monthly cost depends on usage: $0.06 per 100k notifications
├─ Email/Email-JSON notifications (over 1k) Monthly cost depends on usage: $2.00 per 100k notifications
├─ Kinesis Firehose notifications Monthly cost depends on usage: $0.19 per 1M notifications
├─ Mobile Push notifications Monthly cost depends on usage: $0.50 per 1M notifications
├─ MacOS notifications Monthly cost depends on usage: $0.50 per 1M notifications
└─ SMS notifications (over 100) not found
OVERALL TOTAL $725.22
*Usage costs can be estimated by updating Infracost Cloud settings, see docs for other options.
──────────────────────────────────
84 cloud resources were detected:
∙ 15 were estimated
∙ 66 were free
∙ 3 are not supported yet, rerun with --show-skipped to see details
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Project ┃ Baseline cost ┃ Usage cost* ┃ Total cost ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━┫
┃ main ┃ $725 ┃ - ┃ $725 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━┛ After running infracost breakdown --path plan.json
INFO Autodetected 1 Terraform plan JSON file project across 1 root module
INFO Found Terraform plan JSON file project plan.json at directory plan.json
WARN 1 aws_sns_topic price missing across 1 resource
Project: TraceMachina/graveyard/deployment-examples/nativelink-terraform-aws/plan.json
Name Monthly Qty Unit Monthly Cost
aws_instance.build_nativelink_instance["x86"]
├─ Instance usage (Linux/UNIX, on-demand, c6id.2xlarge) 730 hours $294.34
└─ root_block_device
└─ Storage (general purpose SSD, gp3) 8 GB $0.64
aws_instance.build_nativelink_instance["arm"]
├─ Instance usage (Linux/UNIX, on-demand, c6gd.2xlarge) 730 hours $224.26
└─ root_block_device
└─ Storage (general purpose SSD, gp3) 8 GB $0.64
aws_autoscaling_group.scheduler_autoscaling_group
└─ aws_launch_template.scheduler_launch_template
└─ Instance usage (Linux/UNIX, on-demand, c6g.xlarge) 730 hours $99.28
aws_autoscaling_group.cas_autoscaling_group
└─ aws_launch_template.cas_launch_template
└─ Instance usage (Linux/UNIX, spot, r6g.xlarge) 730 hours $39.20
aws_vpc_endpoint.aws_api_ec2_endpoint
├─ Data processed (first 1PB) Monthly cost depends on usage: $0.01 per GB
└─ Endpoint (Interface) 2,920 hours $29.20
aws_autoscaling_group.worker_autoscaling_group_x86_2cpu
└─ aws_launch_template.worker_launch_template["x86"]
└─ Instance usage (Linux/UNIX, spot, m6id.large) 730 hours $23.51
aws_lb.cas_load_balancer
├─ Application load balancer 730 hours $16.43
└─ Load balancer capacity units Monthly cost depends on usage: $5.84 per LCU
aws_lb.scheduler_load_balancer
├─ Application load balancer 730 hours $16.43
└─ Load balancer capacity units Monthly cost depends on usage: $5.84 per LCU
aws_autoscaling_group.worker_autoscaling_group_arm_1cpu
└─ aws_launch_template.worker_launch_template["arm"]
└─ Instance usage (Linux/UNIX, spot, c6gd.medium) 730 hours $3.21
aws_lambda_function.update_scheduler_ips_lambda
├─ Requests Monthly cost depends on usage: $0.20 per 1M requests
├─ Ephemeral storage Monthly cost depends on usage: $0.0000000309 per GB-seconds
└─ Duration (first 6B) Monthly cost depends on usage: $0.0000166667 per GB-seconds
aws_route53_record.cas_lb_domain_record_cert_verify["cas.nativelink.ayewo.com"]
├─ Standard queries (first 1B) Monthly cost depends on usage: $0.40 per 1M queries
├─ Latency based routing queries (first 1B) Monthly cost depends on usage: $0.60 per 1M queries
└─ Geo DNS queries (first 1B) Monthly cost depends on usage: $0.70 per 1M queries
aws_route53_record.scheduler_lb_domain_record_cert_verify["scheduler.nativelink.ayewo.com"]
├─ Standard queries (first 1B) Monthly cost depends on usage: $0.40 per 1M queries
├─ Latency based routing queries (first 1B) Monthly cost depends on usage: $0.60 per 1M queries
└─ Geo DNS queries (first 1B) Monthly cost depends on usage: $0.70 per 1M queries
aws_s3_bucket.access_logs
└─ Standard
├─ Storage Monthly cost depends on usage: $0.023 per GB
├─ PUT, COPY, POST, LIST requests Monthly cost depends on usage: $0.005 per 1k requests
├─ GET, SELECT, and all other requests Monthly cost depends on usage: $0.0004 per 1k requests
├─ Select data scanned Monthly cost depends on usage: $0.002 per GB
└─ Select data returned Monthly cost depends on usage: $0.0007 per GB
aws_s3_bucket.cas_bucket
└─ Standard
├─ Storage Monthly cost depends on usage: $0.023 per GB
├─ PUT, COPY, POST, LIST requests Monthly cost depends on usage: $0.005 per 1k requests
├─ GET, SELECT, and all other requests Monthly cost depends on usage: $0.0004 per 1k requests
├─ Select data scanned Monthly cost depends on usage: $0.002 per GB
└─ Select data returned Monthly cost depends on usage: $0.0007 per GB
aws_sns_topic.update_scheduler_ips_sns_topic
├─ API requests (over 1M) Monthly cost depends on usage: $0.50 per 1M requests
├─ HTTP/HTTPS notifications (over 100k) Monthly cost depends on usage: $0.06 per 100k notifications
├─ Email/Email-JSON notifications (over 1k) Monthly cost depends on usage: $2.00 per 100k notifications
├─ Kinesis Firehose notifications Monthly cost depends on usage: $0.19 per 1M notifications
├─ Mobile Push notifications Monthly cost depends on usage: $0.50 per 1M notifications
├─ MacOS notifications Monthly cost depends on usage: $0.50 per 1M notifications
└─ SMS notifications (over 100) not found
OVERALL TOTAL $747.12
*Usage costs can be estimated by updating Infracost Cloud settings, see docs for other options.
──────────────────────────────────
84 cloud resources were detected:
∙ 15 were estimated
∙ 66 were free
∙ 3 are not supported yet, rerun with --show-skipped to see details
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Project ┃ Baseline cost ┃ Usage cost* ┃ Total cost ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━┫
┃ TraceMachina/graveyard/deployme...velink-terraform-aws/plan.json ┃ $747 ┃ - ┃ $747 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━┛ Essentially, even having this infra running for just 24hrs alone will set me back by $25!! |
@MarcusSorealheis the issue this PR is attached to: #1700 was closed yesterday, so I'm guessing there is no bounty since this is no longer a priority? As such, this PR should be closed as well? EDIT: Output of
|
Description
Overview
The linked benchmarks repo uses the LLVM project as a test case:
An example of the results can be viewed at https://ayewo.github.io/nativelink-rbe-benchmarks/
LLVM Project
I selected LLVM monorepo as the benchmark target due to its dual build system support (CMake + experimental Bazel).
GitHub Actions workflows
01-docker-rbe-worker.yml
- Builds and publishes a custom Docker image to ghcr.io containing all compilation dependencies needed by the LLVM project. This Docker image is used by Nativelink's RBE worker.02-bazel-baseline.yml
- Can be used to reset all performance measurements.03-bazel-benchmarks.yml
- Runs the actual benchmarks and writes the results for remote caching and execution to CSV files.04-apache-otava.yml
- Uses Apache Otava for Change Point Detection (CPD) to automatically catch performance regressions. Apache Otava supports pushing Slack notifications for regression alerts, if aSLACK_BOT_TOKEN
is present.04-ghpages-astro.yml
- Static website generation using Astro framework and deploys to GitHub Pages. It plots 2 line charts using the results from CSV files.Documentation and Deployment
Please see the Deployment section of the
README.md
for step-by-step instructions on how to deploy the code in your GitHub organization.This PR is intended to close: #1700
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Via GitHub Actions. The build target is the
llvm
target from the LLVM project. Building the wholellvm-project
can take several hours which is why only thellvm
target is used and even it can take close to ~4hrs for a cold build (i.e. without remote caching).So please keep in mind that testing using your organization's repo might consume a lot of GitHub Actions runner minutes.
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is