Skip to content

Fix race on startup with HTTP server#5107

Open
SirTyson wants to merge 1 commit intostellar:masterfrom
SirTyson:fix-http-startup-race
Open

Fix race on startup with HTTP server#5107
SirTyson wants to merge 1 commit intostellar:masterfrom
SirTyson:fix-http-startup-race

Conversation

@SirTyson
Copy link
Contributor

Description

Resolves #5099

This PR adds a check to HTTP endpoints and exits early if core is not in the correct state to service the endpoint. I've audited the endpoints and divided them into three categories:

  • Ledger State Dependent Endpoints (i.e. anything with a LedgerHeader or BucketList dependency). Here, we check that LedgerManager is not in LM_BOOTING_STATE such that there is some valid LCL state to query. These endpoints include:

    • /tx
    • /sorobaninfo
    • /upgrades
    • /dumpproposedsettings
    • /self-check
    • /maintenance
    • /getledgerentryraw
    • /getledgerentry
    • /info
  • Herder Dependent Endpoints. These don't need ledger state, but require overlay and herder to be initialized:

    • /peers
    • /connect
    • /droppeer
    • /bans
    • /unban
    • /quorum
    • /scp
    • /stopsurvey
    • /getsurveyresult
    • /startsurveycollecting
    • /stopsurveycollecting
    • /surveytopologytimesliced
  • No dependencies. These include all test-only endpoints for simplicity, as well as diagnostic info that should probably always be live, such as /metrics and /ll.

I thought about just delaying starting the endpoint (like the comment initially mentioned), but given the current length of our startup time, we need endpoints like /metrics available at a minimum. It is also probably better from a consumer view for captive-core to reply with a "not ready" status instead of less graceful errors.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@SirTyson SirTyson requested review from Copilot and marta-lokhova and removed request for Copilot January 23, 2026 22:31
@marta-lokhova
Copy link
Contributor

apologies for reviewing late; for this type of bug (which sadly it's not the first time we're running into), I'd love us to think about a robust solution of type do-it-once-and-forget. Right now we have to update every HTTP endpoint, so I worry whoever is adding a new HTTP endpoint later is going to forget to add the guard (I totally would). We also depend on the exact booting sequence - e.g. ideally CommandHandler doesn't care about the state of Herder - that sequence can change anyway (and it has in the past).

Maybe we can do something simpler?

  • add a bool to CommandHandler, mIsReady=false. Add setReady method to be called by LM when state is loaded.
  • in safeRouter, always check if mIsReady is set, and either proceed by calling regular functions, or return some generic "core is booting".

This is assuming consumers actually care about if the endpoint is reachable. To my knowledge, many search for things like "Synced!" in the info result, so from their point-of-view, core won't be up anyway.

{
mBucketListSnapshots[std::this_thread::get_id()] =
bucketSnapshotManager.copySearchableLiveBucketListSnapshot();
auto [live, hotArchive] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious - what's the motivation behind this specific change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'tx' HTTP endpoint races with startup

3 participants