Skip to content

Move runtime rolling builds to run twice a day on main #59884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 2, 2021

Conversation

safern
Copy link
Member

@safern safern commented Oct 1, 2021

In an effort to reduce the usage of build/test resources we are going to experiment running rolling builds just twice a day vs 6-8 times that was running with run per commit batch.

Currently we chose 00:00 and 12:00 PST to be in the middle of workday loads for each timezone.

Release branches I left as per commit batch as I guess those are not that frequent and when doing servicing we do want to get more protection? If we want to change release branches as well, I can do that also 😄

cc: @jkotas @stephentoub @danmoseley

@safern safern requested a review from a team October 1, 2021 22:59
@ghost
Copy link

ghost commented Oct 1, 2021

Tagging subscribers to this area: @Anipik, @safern, @ViktorHofer
See info in area-owners.md if you want to be subscribed.

Issue Details

In an effort to reduce the usage of build/test resources we are going to experiment running rolling builds just twice a day vs 6-8 times that was running with run per commit batch.

Currently we chose 00:00 and 12:00 PST to be in the middle of workday loads for each timezone.

Release branches I left as per commit batch as I guess those are not that frequent and when doing servicing we do want to get more protection? If we want to change release branches as well, I can do that also 😄

cc: @jkotas @stephentoub @danmoseley

Author: safern
Assignees: -
Labels:

area-Infrastructure-libraries

Milestone: -

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

@trylek trylek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Copy link
Member

@jkoritzinsky jkoritzinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the time schedule choices. Makes it easier to have quicker response times to failures.

@hoyosjs
Copy link
Member

hoyosjs commented Oct 1, 2021

Should stress pipeline and such also follow suit?

@jkotas
Copy link
Member

jkotas commented Oct 1, 2021

Should stress pipeline and such also follow suit?

Yes. We should spread the schedule to even out the machine load.

@danmoseley
Copy link
Member

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://1es.kusto.windows.net/GitHub

PullRequestTimeline
| where OrganizationLogin == "dotnet" and RepositoryName == "runtime"
| extend MergedAt = extract('"merged_at":"([\\w\\-:]+)"', 1, Data)
| where MergedAt != ""
| extend MergedAtDate = todatetime(MergedAt) - timespan(7h) // convert to PDT
| project-keep MergedAtDate
| distinct MergedAtDate
| summarize count() by hourofday(MergedAtDate)
| render columnchart

image

This assumes it's PDT all year around. So on average, the y axis should be 30 mins earlier.

Eyeballing it, 0:00 and 12:00 PDT is not a bad choice.

@safern
Copy link
Member Author

safern commented Oct 1, 2021

Yes. We should spread the schedule to even out the machine load.

I'll put up a PR for that one as well. Maybe we can start it 1 hour earlier or 2 so that they don't overlap on resources as much.

@danmoseley
Copy link
Member

Ah I just noticed you use UTC all year around. So here's the UTC graph
image

And between 08:00 and 20:00 UTC there will be 53% of PR's. Seems pretty good indeed.

@danmoseley
Copy link
Member

danmoseley commented Oct 2, 2021

out of curiosity (I'm sure there are better ways to do this) I dumped all the pushes, rather than the merges. I guess this is a leading indicator of PR validation machine load?

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://1es.kusto.windows.net/GitHub

PullRequestTimeline
| where OrganizationLogin == "dotnet" and RepositoryName == "runtime"
| extend Commit = extract('"head"[^}]*"sha":"(\\w+)', 1, Data)
| where Commit != ""
| extend CommitAt = extract('"updated_at":"([\\w\\-:]+)"', 1, Data)
| extend CommitAtDate = todatetime(CommitAt) 
| project-keep CommitAtDate
| distinct CommitAtDate
| summarize count() by hourofday(CommitAtDate)
| render columnchart

image

If the stress ran at 13:00 UTC and 22:00 UTC it would avoid the busiest times and still split PR's evenly and avoid the rolling times.

OK I'm thinking way to much about this..!

@jkotas
Copy link
Member

jkotas commented Oct 2, 2021

OK I'm thinking way to much about this..!

You are not. These are great insights.

@safern
Copy link
Member Author

safern commented Oct 2, 2021

Thanks for the insights, @danmoseley. I'll set the stress pipeline at that schedule then.

@danmoseley danmoseley merged commit fd5a65c into dotnet:main Oct 2, 2021
@ViktorHofer
Copy link
Member

I assume we want to do the same for the rolling builds of runtime-staging?

@steveisok
Copy link
Member

I assume we want to do the same for the rolling builds of runtime-staging?

I'm fine with that.

/cc @lewing @SamMonoRT

@safern safern deleted the runtimeBuildTwiceADay branch October 4, 2021 17:27
@ViktorHofer
Copy link
Member

@safern would you be the right person to do change runtime-staging as well?

@safern
Copy link
Member Author

safern commented Oct 4, 2021

@ViktorHofer... yes I'm taking care of those as well. I was wondering, @jkotas @BruceForstall @hoyosjs should we also move the coreclr outerloop rolling build to be scheduled rather than triggered?

@jkotas
Copy link
Member

jkotas commented Oct 4, 2021

should we also move the coreclr outerloop rolling build to be scheduled rather than triggered?

Yes, I think so. Thank you!

@BruceForstall
Copy link
Contributor

should we also move the coreclr outerloop rolling build to be scheduled rather than triggered?

IMO, no. I think the coreclr outerloop job, as it currently runs in a batch trigger mode, is very useful as is to relatively quickly identify when a problem was introduced. It's also quite useful to validate the infrastructure state of the system. The JIT team, also, frequently triggers outerloop runs on PRs and having the baseline outerloop job to compare against is very useful.

@safern
Copy link
Member Author

safern commented Oct 4, 2021

We could always use a model where we have 3 runs a day and that way we won't have much changes in between each job making it legit and simple to identify what broke it.

@BruceForstall
Copy link
Contributor

It looks like we had 36 runs of coreclr outerloop last week; about 5 per day. Does restricting to 3 per day really matter?

@jkotas
Copy link
Member

jkotas commented Oct 5, 2021

We are hitting our Azure quotas. Reducing 5 runs to 3 runs per day is 40% saving.

@safern
Copy link
Member Author

safern commented Oct 5, 2021

Does restricting to 3 per day really matter?

IMO, those 2 runs spend a lot of resources, build and test resources, which definitely help PRs to have more resources and have less helix and build wait times for those resources.

@BruceForstall
Copy link
Contributor

What if we split by architecture? E.g., do a rolling, batched, win-x64 runtime-coreclr outerloop job, as today, but only do the non-win-x64 platforms twice a day? This way we get "minimal" pri-1 coverage regularly, and "full" platform coverage less frequently.

@danmoseley
Copy link
Member

If the resources come to about the same, seems reasonable to me.

@lewing
Copy link
Member

lewing commented Oct 5, 2021

so run the less stable platforms, that benefit more from testing, less often?

@safern
Copy link
Member Author

safern commented Oct 6, 2021

I believe that the yml change to do that would be more complicated as we would need to split into multi-job declarations in the yml and condition each one based on the build reason, which I could help with, but is it really worth it? I believe loosing 2 runs a day is not loosing that much coverage, if we were to only run once a day I would agree, but I think we could still be covered if we did something like 11am, 6pm and 2am? Also, every run takes from 2:45hrs to 3:40hrs. So we would only not be covering commits for around 2 hours in each interval?

@BruceForstall
Copy link
Contributor

so run the less stable platforms, that benefit more from testing, less often?

All of these options are compromises. My idea was to at least quickly detect all-platform issues. As I wrote before, I personally prefer not to reduce the testing at all.

I believe that the yml change to do that would be more complicated

Yes. Maybe cloning the runtime-coreclr outerloop pipeline entirely and creating a runtime-coreclr outerloop-rolling would be the easiest way, even if it duplicates a bunch of YML.

Anyway, I just wanted to throw out another idea to try to achieve better coverage levels with the same proposed level of cost.

It's easy to quantify the cost savings of reducing testing (and maybe the increased machine costs of increased testing). However, I don't know how to evaluate the productivity cost of reduced coverage (e.g., when a regression occurs, there are more commits to bisect or examine to find it). I also don't know the cost constraints we are operating under, or how to evaluate that against other costs. So I will leave the decisions here to those who have insight into those aspects.

@safern
Copy link
Member Author

safern commented Oct 6, 2021

However, I don't know how to evaluate the productivity cost of reduced coverage (e.g., when a regression occurs, there are more commits to bisect or examine to find it). I also don't know the cost constraints we are operating under, or how to evaluate that against other costs. So I will leave the de

What about we move it to scheduled with the times I proposed above, we evaluate and see if that works, if we see it is indeed more complicated to find an offending commit or is decreasing productivity, we then can move some legs into rolling?

@BruceForstall
Copy link
Contributor

What about we move it to scheduled with the times I proposed above, we evaluate and see if that works, if we see it is indeed more complicated to find an offending commit or is decreasing productivity, we then can move some legs into rolling?

If we are required to reduce CI usage then yes, we can do that.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants