-
Notifications
You must be signed in to change notification settings - Fork 5k
Move runtime rolling builds to run twice a day on main #59884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @Anipik, @safern, @ViktorHofer Issue DetailsIn an effort to reduce the usage of build/test resources we are going to experiment running rolling builds just twice a day vs 6-8 times that was running with run per commit batch. Currently we chose 00:00 and 12:00 PST to be in the middle of workday loads for each timezone. Release branches I left as per commit batch as I guess those are not that frequent and when doing servicing we do want to get more protection? If we want to change release branches as well, I can do that also 😄 cc: @jkotas @stephentoub @danmoseley
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the time schedule choices. Makes it easier to have quicker response times to failures.
Should stress pipeline and such also follow suit? |
Should stress pipeline and such also follow suit? Yes. We should spread the schedule to even out the machine load. |
Execute: Web | Desktop | Web (Lens) | Desktop (SAW) https://1es.kusto.windows.net/GitHub PullRequestTimeline
| where OrganizationLogin == "dotnet" and RepositoryName == "runtime"
| extend MergedAt = extract('"merged_at":"([\\w\\-:]+)"', 1, Data)
| where MergedAt != ""
| extend MergedAtDate = todatetime(MergedAt) - timespan(7h) // convert to PDT
| project-keep MergedAtDate
| distinct MergedAtDate
| summarize count() by hourofday(MergedAtDate)
| render columnchart This assumes it's PDT all year around. So on average, the y axis should be 30 mins earlier. Eyeballing it, 0:00 and 12:00 PDT is not a bad choice. |
I'll put up a PR for that one as well. Maybe we can start it 1 hour earlier or 2 so that they don't overlap on resources as much. |
out of curiosity (I'm sure there are better ways to do this) I dumped all the pushes, rather than the merges. I guess this is a leading indicator of PR validation machine load? Execute: Web | Desktop | Web (Lens) | Desktop (SAW) https://1es.kusto.windows.net/GitHub PullRequestTimeline
| where OrganizationLogin == "dotnet" and RepositoryName == "runtime"
| extend Commit = extract('"head"[^}]*"sha":"(\\w+)', 1, Data)
| where Commit != ""
| extend CommitAt = extract('"updated_at":"([\\w\\-:]+)"', 1, Data)
| extend CommitAtDate = todatetime(CommitAt)
| project-keep CommitAtDate
| distinct CommitAtDate
| summarize count() by hourofday(CommitAtDate)
| render columnchart If the stress ran at 13:00 UTC and 22:00 UTC it would avoid the busiest times and still split PR's evenly and avoid the rolling times. OK I'm thinking way to much about this..! |
You are not. These are great insights. |
Thanks for the insights, @danmoseley. I'll set the stress pipeline at that schedule then. |
I assume we want to do the same for the rolling builds of runtime-staging? |
I'm fine with that. /cc @lewing @SamMonoRT |
@safern would you be the right person to do change runtime-staging as well? |
@ViktorHofer... yes I'm taking care of those as well. I was wondering, @jkotas @BruceForstall @hoyosjs should we also move the coreclr outerloop rolling build to be scheduled rather than triggered? |
Yes, I think so. Thank you! |
IMO, no. I think the coreclr outerloop job, as it currently runs in a batch trigger mode, is very useful as is to relatively quickly identify when a problem was introduced. It's also quite useful to validate the infrastructure state of the system. The JIT team, also, frequently triggers outerloop runs on PRs and having the baseline outerloop job to compare against is very useful. |
We could always use a model where we have 3 runs a day and that way we won't have much changes in between each job making it legit and simple to identify what broke it. |
It looks like we had 36 runs of coreclr outerloop last week; about 5 per day. Does restricting to 3 per day really matter? |
We are hitting our Azure quotas. Reducing 5 runs to 3 runs per day is 40% saving. |
IMO, those 2 runs spend a lot of resources, build and test resources, which definitely help PRs to have more resources and have less helix and build wait times for those resources. |
What if we split by architecture? E.g., do a rolling, batched, win-x64 runtime-coreclr outerloop job, as today, but only do the non-win-x64 platforms twice a day? This way we get "minimal" pri-1 coverage regularly, and "full" platform coverage less frequently. |
If the resources come to about the same, seems reasonable to me. |
so run the less stable platforms, that benefit more from testing, less often? |
I believe that the yml change to do that would be more complicated as we would need to split into multi-job declarations in the yml and condition each one based on the build reason, which I could help with, but is it really worth it? I believe loosing 2 runs a day is not loosing that much coverage, if we were to only run once a day I would agree, but I think we could still be covered if we did something like 11am, 6pm and 2am? Also, every run takes from 2:45hrs to 3:40hrs. So we would only not be covering commits for around 2 hours in each interval? |
All of these options are compromises. My idea was to at least quickly detect all-platform issues. As I wrote before, I personally prefer not to reduce the testing at all.
Yes. Maybe cloning the Anyway, I just wanted to throw out another idea to try to achieve better coverage levels with the same proposed level of cost. It's easy to quantify the cost savings of reducing testing (and maybe the increased machine costs of increased testing). However, I don't know how to evaluate the productivity cost of reduced coverage (e.g., when a regression occurs, there are more commits to bisect or examine to find it). I also don't know the cost constraints we are operating under, or how to evaluate that against other costs. So I will leave the decisions here to those who have insight into those aspects. |
What about we move it to scheduled with the times I proposed above, we evaluate and see if that works, if we see it is indeed more complicated to find an offending commit or is decreasing productivity, we then can move some legs into rolling? |
If we are required to reduce CI usage then yes, we can do that. |
In an effort to reduce the usage of build/test resources we are going to experiment running rolling builds just twice a day vs 6-8 times that was running with run per commit batch.
Currently we chose 00:00 and 12:00 PST to be in the middle of workday loads for each timezone.
Release branches I left as per commit batch as I guess those are not that frequent and when doing servicing we do want to get more protection? If we want to change release branches as well, I can do that also 😄
cc: @jkotas @stephentoub @danmoseley