Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revision to the nv22 timeline #166

Closed
rjan90 opened this issue Feb 16, 2024 · 4 comments
Closed

Revision to the nv22 timeline #166

rjan90 opened this issue Feb 16, 2024 · 4 comments

Comments

@rjan90
Copy link
Contributor

rjan90 commented Feb 16, 2024

Background for the revision

During reviews of the migration code for network version 22 it became apparent that the complexitites in the migration for Direct Data Onboarding needs a extra revision to land with confidence and correctness.

More specifically, the issue is ensuring that cached migrations during re-orgs are handled correctly that are complex in this migration. Unfortunately, relying solely on a non-cached migration, which is a lot more straightforward, isn't viable given that the benchmarks for non-cached migration are around 10 minutes and would not be acceptable.

Tentative proposal for revised timeline

The implementer teams are currently aligning on a new proposed timeline, which would revise the timelines as follow:

  • Calibration upgrade: March 6th (from previously February 27th)
  • Mainnet upgrade: April 2rd (from previously March 26th)
    • Edited to April 2nd from April 3rd after some feedback.

Forest and Lotus has currently aligned that these dates sound okay (link to dicsussion in #fil-implementers thread), but we are awaiting a final okay from Venus. Please use the above timeline as a heads up and guidance for now. I expect that we will land on a final proposal to revise the timeline no later then 2024-02-19 - 13:00:00Z.

@jennijuju
Copy link
Member

jennijuju commented Feb 19, 2024

Thanks @rjan90!

  • 👍 on the updated the timeline, the extra time deemed to be necessary for a smooth upgrade. In discussion with Orjan on whether we can still make it in EOMarch - it was flagged that there is an Easter holiday in the end of the march that Lotus & Forest team members will be OOO. To make sure that implementers are on-call to monitor the upgrade, I agree it is the right decision to postpone to the first week of April.
  • It is not ideal that the upgrade slips the Q1 timeline, we should organize a retro to trouble shoot on what caused the delay (in both governance, implementation development and testing process) and how we could've mitigated it. @luckyparadise could you please help organize one?
  • re migration, 2 trade-offs imho should've been considered:
    • network/service uptime vs consistent state post-migration:if the state migration with cache seems to be error prune & might lead nodes take longer to recover if the migration ends up in a bad state - it might worth node operators considering to use the non-cache route even tho it takes longer.
      • avoid bad-state nodes that will increases the service down time
      • proactively schedule a service maintenance window and inform users
    • temporarily disable features vs. a smoother migration: it seems like the challenge with cached migration is around states diff under re-orgs, I think in this case client could've considered to disable functionalities that will impact deal state, mainly PSD for a couple hours before the upgrade, just to reduce the migration state mismatch risk.
  • more on migration: It is great that lotus and forest nodes are working closely on testing client implementation before the calibration release, one of them being ensuring both client produces the same post migration state. It is worth calling out that the network snapshot service must have lotus & forest state validation check implemented before this upgrade to ensure a healthy chain snapshot that is delivered to the users post upgrade.
  • I am curious if the protocol change could've been made w/o such a heavy migration.

@lemmih
Copy link

lemmih commented Feb 20, 2024

It is worth calling out that the network snapshot service must have lotus & forest state validation check implemented before this upgrade to ensure a healthy chain snapshot that is delivered to the users post upgrade.

Could you elaborate on these state validation checks? Are these general sanity checks, or are they also NV22-specific?

@jennijuju
Copy link
Member

It is worth calling out that the network snapshot service must have lotus & forest state validation check implemented before this upgrade to ensure a healthy chain snapshot that is delivered to the users post upgrade.

Could you elaborate on these state validation checks? Are these general sanity checks, or are they also NV22-specific?

I think in general snapshot services should implement state checks across lotus and forest nodes before publishing a snapshot.
And I think it’s even more critical for post upgrade snapshots given the chain could be more reorg-y / nodes are more prune to state mismatch post a heavy migration.

@luckyparadise
Copy link
Collaborator

Can I close this issue @rjan90 ? It appears we have a consensus on this matter now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants