Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(batch): Improve batch verifier async, correctness, and performance #4750

Merged
merged 20 commits into from
Jul 17, 2022

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 5, 2022

Motivation

Zebra's tower-batch has some async performance issues, and contains some buggy and risky code.
We should replace it with more robust code.

Close #4738
Close #4740
Close #4729 (by using a channel that never lags)
Close #4728 (by removing the log entirely)

API Reference

tokio::spawn_blocking():
https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html

Solution

Improve tower-batch async fairness:

  • Prefer the batch timer if there are also new batch requests
  • Allow other tasks to run after each batch
  • Label each batch worker with the verifier's type in tokio-console

Improve error handling in tower-batch:

  • Check batch worker tasks for panics and task termination
  • Rename Handle to ErrorHandle, and fix up some docs

Improve channel limit handling in tower-batch:

  • Use a new channel for each batch
  • Use tokio's PollSemaphore instead of an outdated Semaphore impl

Improve channel limit handling in zebra-consensus:

  • Replace batch verifier broadcast channels with watch channels

Run all verifier cryptography on blocking CPU-bound threads

  • Run batch verification on blocking tokio threads
  • Run fallback verification on blocking tokio threads
  • Make flush and drop behaviour consistent for all verifiers
  • Switch batch verifier tests to the multi-threaded runtime

Related fixes:

  • Fix a bug in the v5_with_sapling_spends test (but it's still missing NU5 test vectors)

Review

This is blocking halo2 batch verification, so it would be good to get it in soon.
This might fix some bugs we're seeing during syncing.

Reviewer Checklist

  • Code implements Specs and Designs
  • CI passes

@teor2345 teor2345 added C-bug Category: This is a bug C-cleanup Category: This is a cleanup P-High 🔥 I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures A-diagnostics Area: Diagnosing issues or monitoring performance A-cryptography Area: Cryptography related labels Jul 5, 2022
@teor2345 teor2345 self-assigned this Jul 5, 2022
@codecov
Copy link

codecov bot commented Jul 5, 2022

Codecov Report

Merging #4750 (14d8e78) into main (61eeeb0) will increase coverage by 0.14%.
The diff coverage is 88.77%.

@@            Coverage Diff             @@
##             main    #4750      +/-   ##
==========================================
+ Coverage   78.86%   79.00%   +0.14%     
==========================================
  Files         306      305       -1     
  Lines       37664    37897     +233     
==========================================
+ Hits        29704    29941     +237     
+ Misses       7960     7956       -4     

@teor2345 teor2345 changed the title fix(batch): Improve error and channel limit handling in tower-batch fix(batch): Improve error and channel limit handling in tower-batch, run all verifier cryptography on blocking threads Jul 5, 2022
@teor2345 teor2345 changed the title fix(batch): Improve error and channel limit handling in tower-batch, run all verifier cryptography on blocking threads fix(batch): Improve batch verifier correctness and performance Jul 5, 2022
@teor2345 teor2345 changed the title fix(batch): Improve batch verifier correctness and performance fix(batch): Improve batch verifier async, correctness, and performance Jul 5, 2022
@teor2345 teor2345 changed the base branch from main to fix-redundant-halo2 July 5, 2022 23:59
@teor2345
Copy link
Contributor Author

teor2345 commented Jul 6, 2022

I'm running a full sync here, to see if these fixes help:
https://github.com/ZcashFoundation/zebra/actions/runs/2619584632

Base automatically changed from fix-redundant-halo2 to main July 6, 2022 14:11
@teor2345 teor2345 added P-High 🔥 and removed C-cleanup Category: This is a cleanup A-diagnostics Area: Diagnosing issues or monitoring performance P-Medium ⚡ labels Jul 7, 2022
@teor2345 teor2345 mentioned this pull request Jul 10, 2022
31 tasks
@teor2345

This comment was marked as outdated.

@teor2345 teor2345 requested review from conradoplg and removed request for a team July 13, 2022 21:58
@teor2345 teor2345 removed request for a team July 15, 2022 00:50
Copy link
Contributor

@oxarbitrage oxarbitrage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks ok but i am not approving yet in case you want to do #4776 on top of this branch first. Feel free to admin merge or let me know how you will want to handle it and ill approve where needed.

@teor2345 teor2345 merged commit 9b9cd55 into main Jul 17, 2022
@teor2345 teor2345 deleted the batch-fixes branch July 17, 2022 22:41
@teor2345
Copy link
Contributor Author

I admin-merged this separately from PR #4776, just in case we need to debug them as separate commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cryptography Area: Cryptography related C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
2 participants