-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.10 Issue Burndown #86
Comments
|
|
All SIG-Azure issues have been curated. |
Thank you so much! This takes the organizational cake. |
As of today, we have 21 issues open against the 1.10 milestone, which is the same as yesterday. However, several of those issues have moved from yellow to green status because of PRs being approved/fixed, so we can expect a drop in the number of issues over the weekend. On the down side, several issues are of special concern, as they represent severe problems which may throw off the release schedule. These are all in the Red section and detailed there. Tracking spreadsheet is here and is up to date as of this afternoon. RedIssues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions. The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that doesn't produce a major performance regression. Depending on how fixing it goes, we may have to punt on it for 1.10.0 and wait for a fix in 1.10.1. These two are really the same issue, and show what may be a very large performance regression in 1.10 even without a Stale Reads fix. YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
GreenIssues with an approved PR which is just waiting for labels, release notes, or automation.
KickIssues just waiting for the grace period to elapse before being kicked out of 1.10.
Special IssuesPrimarily tracking issues.
KickIssues which are waiting for automation to kick them out of the milestone. |
As of around 10am PST today, we have 25 issues open against the 1.10 milestone, which is an increase of 4 from Friday. Most of the new issues are actually breakouts of a larger test fail issue, in order to have one issue per SIG for resolution (see test fails below). Tracking spreadsheet is here RedIssues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions. The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. Test flakes are fixed, so hopefully we can get confirmation (or not). YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
New Test FailsWe've added a bunch of test fail tracking over the weekend. These represent related test fails across several test suites, with individual asignees. All are considered Yellow right now, as they either have fixes in progress, or are too new to be considered stuck. Most of these are being tracked from issue #60003.
GreenIssues with an approved PR which is just waiting for labels, release notes, or automation.
Special IssuesPrimarily tracking issues.
KickIssues just waiting for the grace period to elapse before being kicked out of 1.10. |
As of around 11am PST today, we have 26 issues open against the 1.10 milestone, which is an increase of 1 from yesterday. Most issues are various test failures, as non-test-fail issues are getting resolved. Tracking spreadsheet is here RedIssues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions. The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify. YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
Test Fails In ProgressThe vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures. In at least one case, we need to unfreeze the 1.9 tree to fix the test. Not clear at this point whether we have a general upgrade issue the way we did in 1.9.
GreenIssues with an approved PR which is just waiting for labels, release notes, or automation.
Special IssuesPrimarily tracking issues.
|
As of around 4pm PST today, we have 22 issues open against the 1.10 milestone, which is a decrease of 4 from yesterday. Most issues are various test failures, as non-test-fail issues are getting resolved. Tracking spreadsheet is here Critical concerns right now are:
RedIssues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions. The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify.
These two failing tests are receiving zero attention from their respective SIGs, 3 days after notice. SIGs bothered on Slack. YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
Test Fails In ProgressThe vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures.
These two test fails seem to require modifying tests/code for 1.9 in order to fix. We need a hotfix to allow the owners to do this, and then we need a better procedure for handling upgrade tests in the future so that it doesn't lead to needing to patch tests on an older, frozen, version.
This issue is unconfirmed and not yet assigned to 1.10, but could be causing some of the test failures above: GreenIssues with an approved PR which is just waiting for labels, release notes, or automation. Special IssuesPrimarily tracking issues.
|
As of around 4pm PST today, we have 18 issues open against the 1.10 milestone, which is a decrease of 4 from Wednesday, and a great trajectory to be on. Most issues are various test failures, as non-test-fail issues are getting resolved, and all test fails are now getting attention. Tracking spreadsheet is here Critical concerns right now are:
RedIssues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions. This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify. The Stale Reads issue is still potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that doesn't produce a major scalability regression. However, it looks highly unlikely to be resolved in the next 2 weeks, so we are recommending taking it out of the 1.10 milestone. YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10. Test Fails In ProgressThe vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures.
These four test fails seem to require modifying tests/code for 1.9 in order to fix. We need a hotfix to allow the owners to do this, and then we need a better procedure for handling upgrade tests in the future so that it doesn't lead to needing to patch tests on an older, frozen, version. Either that, or we have to decide to ignore the upgrade tests for 1.10.
GreenIssues with an approved PR which is just waiting for labels, release notes, or automation.
Special IssuesPrimarily tracking issues.
|
As of around 1pm PDT today, we have 15 accepted issues open against the 1.10 milestone, and four possibles, which is a decrease of 3 from Friday. Further, four issues are likely to close in the next 2-5 hours as fixed tests pass. However, we have a couple of major blocker issues which look possible to delay the release, see Red below. Tracking spreadsheet is here RedThe big potential release-derailer is the major performance regressions possibly due to unidentified performance changes in etcd: At this point, it is unclear if the etcd issues account for all of the problems, or if changing etcd version/settings will fix the issues. This is an apparently unrelated increase on memory used by the API server. @shyamjvs has been hard at work bisecting for this, and may have found a culprit: IMHO, these two performance regressions are significant enough to warrant a release delay. We also have two test fails which have been receiving no attention. While neither looks that bad, I'm flagging them because we don't actually know what's causing them:
The Stale Reads issue has been removed from 1.10. YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
Test Fails In ProgressAll of the below are in progress in some way, but don't yet have a clear resolution.
PossiblesThese three issues may be 1.10 issues; they were recently reported and seem related to other issues with 1.10. However, none of them have been examined by the SIGs yet. None look like release-blockers.
GreenIssues with an approved PR which is just waiting for labels, release notes, or automation. A bunch of these are tests which are being fixed now that code was merged into 1.9, just waiting on them to pass.
Special IssuesPrimarily tracking issues. |
Burndown report as of 10am PDT March 13 DELETED because issues with Google Sheets caused it to be inaccurate. New burndown shortly. |
Corrected March 13th Burndown report: Status as of noon, PDT, March 13th. We still have 19 issues open. Some of these don't show up on github searches because they are not labelled correctly, which is being resolved. We are waiting for a bunch of changes to go into 1.9 to fix downgrade tests, which should clear up some of the pending issues list. This does point to needing a change in how we do downgrade tests in the future. There are several major regressions without resolution right now, sufficient to delay ending code freeze, and possibly the final release per burndown meeting this AM. RedIssues which are blockers, or are status unknown and look serious, without a good PR. Performance issues in analysis/progress: Failing tests with currently unknown causes:
Regression in progress, but fix untested. Issue was accidentally dropped by SIG and just picked up again: YellowIssues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs. Test Fails in Progress
1.9 downgrade tests issuesThese issues are waiting for code on 1.9 in order to fix the downgrade tests.
GreenNon-blocker issues.
Tracking Issues |
/cc |
Status as of noon, PDT, March 14th. We have three issues which were re-opened yesterday, because they were closed in advance of verifying the tests. Several of the upgrade/downgrade tests have been fixed, but we are waiting on all tests to pass before we actually clear them, since several test fails have been closed and reopened. A big thanks to @liggitt for pursuing those. Overall status is "Crimson". We have multiple unclosed issues, any of which are sufficient to block release, and two of which (performance/scalability and ) have no specific timeline for resolution. We also have one trailing feature of unknown status. Further delaying the release seems more likely than not. RedIssues which are blockers, or are status unknown and look serious, without a good PR. Performance issues in analysis/progress. Some of the performance and scalability issues have been resolved, and others have been broken out into more specific issues. There are some issues (see Green below) which are not expected to be resolved for 1.10, but are regarded as non-blockers. Many thanks to @shyamjvs for diving into these regressions!
Failing tests with currently unknown causes:
Regression in progress, but fix not passing tests: Orphaned feature, awaiting response from SIG: YellowIssues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs. Test Fails in ProgressThese are currently all upgrade tests
GreenNon-blocker issues (expected to remain broken for 1.10, need to add release note): Flaky timeouts while waiting for RC pods to be running in density test Resolved, pending having all tests passing: Apiserver CPU/Mem usage bumped to 1.5-2x in big clusters Resolved, waiting for automation: zsh completion throws error v1.10.0-beta.2 Tracking Issues |
As of around noon PDT today, we have 12 accepted issues open against the 1.10 milestone, which is a decrease of 3 from yesterday. Further, four issues are likely to close in the next 2-5 hours as fixed tests pass. However, we have a couple of major blocker issues which look possible to delay the release, see Red below. Tracking spreadsheet is here RedThe big potential release-derailer is the major performance regressions:
These two performance regressions are considered significant enough to warrant a release delay work on them, including git bisect and scalability testing, has been ongoing. This is slow due to the relatively small number of folks who understand kube scalability and the scalability tests. You a performance geek who wants to get involved with Kubernetes? We could use you. This test fail may be related to the fluentd performance issues, but root cause unknown: Pod deletion has a problem with race conditions, work is in progress but intial patch attempt needs work: YellowIssues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10. The problem with being unable to delete PVCs on downgrade is in progress, in the form of manual downgrade docs and a patch for the tests, with an actual fix due in 1.9.5: This GCE deprecated flag issue has a PR in progress and near approval. It is currently breaking a lot of unrelated tests: The rest of the Daemonset Scheduling work looks likely to be postponed until 1.11, but it's unclear at this point what would be required to back out committed work: GreenIssues that are non-blockers or expected regressions and are expected to remain issues after 1.10.0 release:
Issues with an approved PR which is just waiting for labels, release notes, or automation:Test fails which have been fixed but we're waiting for a couple days of green before we stop watching them:
Special IssuesPrimarily tracking issues. |
As of around noon PDT today, we have 8 accepted issues open against the 1.10 milestone, which is a decrease of 4 from yesterday. At this point, we have three outstanding areas of work, which relate to multiple issues: the performance regressions, PVC protection downgrade, and Daemonset scheduling. Everything else known is resolved. Tracking spreadsheet is here RedPerformance regressions are in progress, but still not completely nailed down. Bisect has revealed a candidate issue which is possibly due to an already-reverted PR, and as such the release team wants to get an RC built so that they can really test the tweaks already made:
SIG-Storage is working to fix the failing test for downgrade of protected PVCs. @liggitt is working on a test patch to implement the manual instructions so that we can complete the downgrade tests. Risk: some of the other downgrade tests start failing now that they can finish running.
YellowDaemonset scheduling feature is cleared to go into 1.10. All code updates have been merged, although one refactoring PR is deferred to 1.11. Remaining open PR is docs, plus release notes are needed. Risk: we may break new tests with merge this morning.
We also have one more miscellaneous test fail in progress: Special IssuesPrimarily tracking issues. |
Status as of noon, PDT, March 19th. Overall status is "saffron" (yellow with some organge). While the majority of bugs are either closed or have short-timeline plans for closure, we still have outstanding performance issue(s) whose cause is unknown. RedIssues which are blockers, or are status unknown and look serious, without a good PR. Performance issues in analysis/progress. With fluentd patches, performance issues have been addressed within acceptable tolerances (there is increased resource usage in this version of Kubernetes, period). Except this one, whose cause is still unknown:
YellowIssues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs. Features
PVC protection: This is being dealt with as a documentation bug with a documented workaround. There is a patch in progress against 1.9 that will make PVC downgrade work, to come out with 1.9.6. In the meantime, users who have a lot of PVCs should be encouraged to wait to upgrade until 1.9.6 is out. Bugs
Test Fails in Progress
Both of these test fails are related to either Daemonset scheduling, and should turn green soon now that PRs are merged. Hoping! GreenNon-blocker issues (expected to remain broken for 1.10, generally need to add release note): Flaky timeouts while waiting for RC pods to be running in density test Tracking Issues |
Status as of 11am, PDT, March 20th. Overall status is "tangerine" (trending red). While the majority of bugs are either closed, we still have outstanding performance issue(s) whose cause is unknown and may delay the release. We also have several other unrelated issues which need fixing. Tracking sheet is here as always. RedRelease Blockers without a resolution timeline of less than 24 hours. Performance issues in analysis/progress. There are two, which may be related, causing unacceptable performance on GCE. The cause of these may be in some way related to fluentd, but that doesn't make them a non-blocker:
GKE tests are no longer running due to some issue with GKE, but we need this resolved before we release: YellowIssues which are blockers, which are expected to resolve in 24 hours. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs. Daemonset Scheduling feature needs to be reverted, or more accurately neutralized by disabling the alpha gate, before release:
PVC protection workaround for 1.10.0, with fix pending for 1.9. This shouldn't be a blocker anymore: GreenNon-blocker issues (expected to remain broken for 1.10, generally need to add release note): Flaky timeouts while waiting for RC pods to be running in density test Tracking Issues |
Status as of 11am, PDT, March 21st. Happy Nowruz! "zardi ye man az to, sorkhi ye to az man" seems particularly appropriate here. Overall status is "straw" (light yellow). At this point, everthing is resolved or resolving in the next few hours except for #60589, which suffers from having to make a potentially painful tradeoff. RedRelease Blockers without a resolution timeline of less than 24 hours. This issue has been traced to a commit which was also a bugfix. At this point, we need opinions from multiple SIG leads about what to do on a reversion: YellowIssues which are blockers, which are expected to resolve in 24 hours. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs. Issue with subpaths which would prevent someone from upgrading from specific versions of Kubernetes, which has a patch just waiting to be cherrypicked: GreenThe Fluentd scaler issue has been fixed sufficient to not be a blocker for 1.10. There are still effects of it which will need fixing in future point releases:
GKE tests are now running. Daemonset scheduling and PVC protection issues have been resolved. Important release note for PVC protection regarding downgrades. Non-blocker issues (expected to remain broken for 1.10, generally need to add release note): Flaky timeouts while waiting for RC pods to be running in density test |
1.10 is out now, closing. |
This issue is for status updates on 1.10 issues for release tracking.
The text was updated successfully, but these errors were encountered: