-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze releases and website changes, pending cache fixes? #1416
Comments
Major +1 here, the frequent releases and website updates cause a full cache purge in Cloudflare every single time, putting a massive load on the origin that it is currently unable to handle, leading to Node.js becoming essentially unavailable to download. Until time is put into reworking the origin (likely moving it primarily to R2 with a Worker that handles fallback to the origin server), and into reworking how cache purging happens for releases, I would agree that doing a freeze of releases + website updates makes sense to ensure the Cloudflare cache is retained so Node.js is actually available for folks to even download (there's no point releasing new versions if folks can't download them or read the docs for them). |
Major +1 here, A random unordered mental dump:
Few (potential) suggestions:
|
Linking to the cache purge tracking issue nodejs/build#3410 |
I don't have a lot of context here but it sounds like there's pain and it's great to relieve pain so I'm all for whatever needs to be done here. |
There are no good option here. The best outcome would be to have somebody redesign these pipelines and only purges the URLs that are needed, or possibly nothing at all (and use stale-while-revalidate semantics). Given that this requires a volunteer to lead that effort or funds, possibly the least bad options would be to:
I'm not happy with any of these, but I don't think we can do much better right now. |
I have raised this before at https://github.com/nodejs/build, but since we are discussing hosting static files, is there a reason why we are managing the infrastructure ourselves, and not using a managed service for this? (e.g amazon s3/azure blob storage/github pages/cloudflare pages etc) I think the solutions suggested above are fine for the short term, but if there is no other reason we should probably consider a managed solution for the long term. if this makes sense, I am glad to lead such an effort |
@MoLow mostly money. Node.js infrastructure consumes very little of the foundation money. Moreover, all of this was put in place a long time ago and there were fewer options at the time. |
I think it's also related to the fact that Node.js was a lot less downloaded when this was put into place many years ago. |
Hey @mcollina, just to mention that your proposed solutions will not solve the situation (I guess that's why you mentioned bad options?) but just reduce the problems (from what you've explained, they would reduce somewhat significantly already, but it's just a patch I'd say, because the moment we do cache invalidations the issue happens, because our servers are just unable to handle) |
We are having talks over adopting Cloudflare R2, they offered us the R2 service (similar to AWS S3) for free with all the traffic and needs we have. It is a path we're exploring! |
A managed solution still requires someone to "manage" them or at least maintain them. In the case of R2 we need to write Cloudflare Workers and do a lot of initial configs just to mirror our current FYI a lot of discussion is happening on |
I think it's better to leave the people at Node.js Build WG that understand the situation completely to lead this initiative technically. What we need is an ack from the TSC about this issue and that we're able to dedicate resources into this. Not to mention, what @ljharb suggested would already be a temporary "workaround" to improve the user experience, by reducing Website builds and releasing "promotions". We still need someone (or a bunch of people) to be able to do the long term-plan... |
I'd be ok with not invalidating for nightly and canary releases or possibly doing them less often. For the others I think that releases are not often enough that we should slow down releases of Current and LTS lines. |
As @ovflowd mentions the key question is what we do in the mid to long term in terms of "We still need someone (or a bunch of people) to be able to do the long term-plan...". Sounds like @MoLow who is a member of the build WG has offered to lead work on the mid to longer term plan in #1416 (comment) and I think that it would be great to start working on that. I also think that in terms of keeping things up/running even after we have a new/better infrastructure we need people who can drop everything else when needed to address problems with the downloads, OR set the expectation that it's a best effort and there is no SLA. The downloads may not be available at any point in time and people should plan for that. On this front I've asked for help from the Foundation in the past on the build side, presented to the board, worked with Foundation staff on summaries of work etc. but unfortunately that did not result in resources to let us be more proactive. It may be a different time, and or the situation more urgent now so looking at that again might make sense. |
I completely forgot @MoLow was on the build team, +1 for him to lead the initiative! |
Thanks for bringing this up. Currently, there is an LTS release in flight that I'd like to get out because it has a lot of anticipated changes (nodejs/node#48694). I had planned to get it out around 1:00 UTC to accommodate a "low activity" time, but that doesn't look like it's going to happen. Instead, I'm just going to get this release out as soon as possible (hopefully in the next 12 hours), and then in the next release meeting we can discuss optimal time frames for promoting builds. |
Thanks @danielleadams, I'll be monitoring our infra, I'll let you know if anything weird happens 👀 |
In terms of actual releases we're not doing them that often (for example, the last non-security 18.x release prior to the one @danielleadams is working on was back in April) -- I don't think freezing releases would actually gain much. The last actual release, for example, was 20.4.0 on 5 July and we've had plenty of issues since then without a new release being put out. We are purging the CloudFlare cache perhaps three or more times a day for the nightly and v8-canary builds -- as far as the current tooling/scripts are concerned there is no difference in how those are treated vs releases (so it's one thing saying that maybe they should not be, but another to do the remedial work). And while frequent cache purges are certainly not helping the situation, I'm not convinced that the problem is entirely related to the CloudFlare cache. |
I think perhaps the wording here, regarding freezing of releases, was intended to also capture the release of nightly/canary builds, as those also cause cache purges. While I agree I don't think cache purging itself is probably the issue at its core here, the origin seems to just be rather unhappy, avoiding purging the cache many times a day is definitely going to massively improve the situation, as Cloudflare will be able to actually serve stuff from their cache, rather than it being repeatedly wiped and forcing traffic to be served from the struggling origin. |
☝️ exactly this!
Same here. If we can avoid purging caches for nightly/canary releases as it might not be that much needed, that'd be great! |
I think we chose not to freeze release or changes to the website so this can be closed. Unless there are objections to be closing this in the next few days I'll go ahead and do that. |
+1, the website is no longer served out of NGINX and releases are now served from R2 afaik, so I think this is no longer a problem. |
Every time a commit is pushed to the website, or a release is done, I'm told the CloudFlare cache of nodejs.org/dist is purged, which causes a lot of server churn as the cache is repopulated, which also causes both nodejs.org/dist and iojs.org/dist to break.
During this time, anyone trying to install node may encounter 5xx errors; anyone using
nvm
to do anything remote may encounter 5xx errors (nvm relies on both index.tab files to list available versions to install); and any CI based on dynamically building a matrix fromindex.tab
is likely to encounter 5xx errors.I would offer my opinion that "changes to the website" are likely never more important than "people's ability to install node", and "a new release of node" is, modulo security fixes, almost never more important than that ability either.
Fixing the problem requires people with all of access, ability, and time, and one or more of those has been lacking for awhile - and to be clear, I'm not complaining about this fact: everyone involved in node is doing their best to volunteer (or wrangle from an employer) what time they can. However, I think it's worth considering ways to avoid breakage until such time as a fix can be implemented.
Additionally, this seems like very critical infrastructure work that perhaps @openjs-foundation could help with - cc @rginn, @bensternthal for thoughts on prioritizing this work (funding and/org person-hours) for DESTF?
I'd love to hear @nodejs/build, @nodejs/releasers, and @nodejs/tsc's thoughts on this.
Related: nodejs/nodejs.org#5302 nodejs/nodejs.org#4495 and many more
The text was updated successfully, but these errors were encountered: