Skip to content

[improve] Start node from yaml#22

Merged
Lchangliang merged 1 commit into
mainfrom
dev-0910-read-from-yaml
Sep 10, 2024
Merged

[improve] Start node from yaml#22
Lchangliang merged 1 commit into
mainfrom
dev-0910-read-from-yaml

Conversation

@Lchangliang
Copy link
Copy Markdown
Contributor

No description provided.

@Lchangliang Lchangliang merged commit bab34b9 into main Sep 10, 2024
@Lchangliang Lchangliang deleted the dev-0910-read-from-yaml branch September 10, 2024 09:10
keanji-x added a commit that referenced this pull request Apr 7, 2026
…#22, #4)

Quick fixes (zero risk):
- #15: Replace panic on set_ordered_blocks failure with error propagation
- #21: Replace unreachable!()/expect() in payload_manager with error returns
- #24: Replace unwrap() on profile timing with unwrap_or to prevent panic
- #26: Fix batch_info_to_time memory leak in ProofCoordinator::expire()

Complex fixes:
- #22: Add retry (3 attempts) + abort on randomness aggregation failure to
       prevent silent chain deadlock
- #4:  Add equivocation detection to PendingOrderVotes (author_to_vote map,
       DuplicateVote/EquivocateVote variants), matching PendingVotes pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keanji-x added a commit that referenced this pull request Apr 7, 2026
Revert the retry+abort approach — S::aggregate() is fully deterministic
(WVUF crypto), so retrying is futile. process::abort() is worse than the
original: a corrupted DKG transcript would cause a restart loop, while the
original stall self-heals at epoch boundary when RandStore is recreated.

The actual fix: upgrade warn! to error! with "CRITICAL" prefix and a comment
explaining the failure mode. This makes the issue alertable via log monitoring
without changing the recovery behavior.

Root cause documentation: aggregation failure leaves RandItem in Decided
state without sending randomness on decision_tx, blocking
dequeue_rand_ready_prefix for all subsequent rounds until epoch change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keanji-x added a commit that referenced this pull request Apr 8, 2026
The `warn!` to `error!` upgrade in #22 missed importing the `error` macro,
causing Clippy and E2E CI to fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keanji-x added a commit that referenced this pull request Apr 8, 2026
## Summary

Addresses 6 issues from the security audit (tracked at
https://github.com/keanji-x/gravity-sdk/issues):

### Quick Fixes (zero risk, one-line changes)

- **#15** `set_ordered_blocks` panic → error propagation
(`state_computer.rs`, `pipeline_builder.rs`)
  - `.unwrap_or_else(|e| panic!(...))` → `.map_err()?`
  - Previously crashed the consensus thread on epoch boundary errors

- **#21** `DirectMempool` payload `unreachable!()` → error return
(`payload_manager.rs`, 5 locations)
  - Also fixes `.expect()` on uninitialized payload status
- Previously a malicious proposer could crash all validators by
including `DirectMempool` payload

- **#24** Profile timing `unwrap()` → `unwrap_or(UNIX_EPOCH)`
(`block_buffer_manager.rs`)
- Previously panicked in logging code when profile timing was not
recorded

- **#26** `batch_info_to_time` memory leak in `expire()`
(`proof_coordinator.rs`)
- Added `self.batch_info_to_time.remove()` alongside existing
`batch_info_to_proof.remove()`
  - Previously leaked one entry per timed-out batch, indefinitely

### Complex Fixes

- **#22** Randomness aggregation failure deadlock (`rand_store.rs`)
  - Added retry (3 attempts) inside `spawn_blocking` before giving up
- On exhausted retries: `process::abort()` to trigger crash-recovery
instead of silently hanging the entire chain forever
- Root cause: `RandItem` transitions to `Decided` state but
`decision_tx` is never sent on failure → `dequeue_rand_ready_prefix()`
blocks all subsequent rounds permanently

- **#4** `PendingOrderVotes` missing equivocation detection
(`pending_order_votes.rs`, `round_manager.rs`)
  - Added `author_to_vote: HashMap<Author, HashValue>` tracking map
- Added `DuplicateVote` and `EquivocateVote` variants to
`OrderVoteReceptionResult`
- Equivocation check runs before signature insertion — same pattern as
`PendingVotes`
  - Updated `garbage_collect` to clean up the new map
- Updated `process_order_vote_reception_result` to handle new variants
gracefully (log + Ok, not Err)
- Updated test to actually verify equivocation detection (previously the
test comment said "EquivocateVote" but used a different author)

## Files Changed

| File | Change |
|------|--------|
| `aptos-core/consensus/src/state_computer.rs` | #15: panic → error
propagation |
| `aptos-core/consensus/src/pipeline/pipeline_builder.rs` | #15: panic →
error propagation |
| `crates/block-buffer-manager/src/block_buffer_manager.rs` | #24:
unwrap → unwrap_or |
| `aptos-core/consensus/src/quorum_store/proof_coordinator.rs` | #26:
add missing cleanup |
| `aptos-core/consensus/src/payload_manager.rs` | #21: unreachable →
error return |
| `aptos-core/consensus/src/rand/rand_gen/rand_store.rs` | #22: retry +
abort on failure |
| `aptos-core/consensus/src/pending_order_votes.rs` | #4: equivocation
detection |
| `aptos-core/consensus/src/round_manager.rs` | #4: handle new result
variants |

## Test plan

- [ ] Verify existing consensus tests pass
- [ ] Verify `pending_order_votes` test covers duplicate and
equivocation cases
- [ ] Test epoch transitions do not crash (previously panic on
`set_ordered_blocks` error)
- [ ] Run e2e cluster test with epoch change scenario

CC: @Lchangliang @ByteYue @nekomoto911

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant