Skip to content

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Nov 4, 2025

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Fix deadlock issue in BMT (Binary Merkle Tree) hasher that was causing TestJoinerRedundancyMultilevel to timeout after 10 minutes.

The issue was caused by:

  1. Unbuffered result channel blocking when multiple goroutines try to send
  2. Race condition where multiple goroutines could reach root and attempt to send result simultaneously
  3. With large files (16384 chunks), many concurrent goroutines increased the probability of deadlock

Solution:

  • Buffer result channel (size 1) to allow at least one send to complete
  • Use sync.Once to ensure only one result is sent even if multiple goroutines reach the root
  • Use non-blocking select when sending to result channel
  • Properly reset sync.Once in Reset() method for hasher reuse

Fixes test timeout in pkg/file/joiner/joiner_test.go:TestJoinerRedundancyMultilevel

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

Fix deadlock issue in BMT (Binary Merkle Tree) hasher that was causing
TestJoinerRedundancyMultilevel to timeout after 10 minutes.

The issue was caused by:
1. Unbuffered result channel blocking when multiple goroutines try to send
2. Race condition where multiple goroutines could reach root and attempt
   to send result simultaneously
3. With large files (16384 chunks), many concurrent goroutines increased
   the probability of deadlock

Solution:
- Buffer result channel (size 1) to allow at least one send to complete
- Use sync.Once to ensure only one result is sent even if multiple
  goroutines reach the root
- Use non-blocking select when sending to result channel
- Properly reset sync.Once in Reset() method for hasher reuse

Fixes test timeout in pkg/file/joiner/joiner_test.go:TestJoinerRedundancyMultilevel
@acud
Copy link
Contributor

acud commented Nov 20, 2025

@gacevicljubisa could you explain the deadlock to me? I read the PR description but I can't see how does the BMT hasher fit the problem description:

  1. If the BMT hasher hashes just one chunk, how does have a large file with multiple chunks affect the hasher? The hasher is called per chunk. The hashing of a joined file is handled in the hashing pipeline iirc
  2. in bmt.go, as far as I can see, only one goroutine is allowed to write into the result channel - L102 on master (since this is the only call to processSection with final==true). The only way that you may get a deadlock here is if Hash is called twice on the same data at the same time.
  3. In case this fix is indeed still necessary, I think that the sync.Once can be avoided here. If you make the channel have a buffer of 1, it means that the first write will always succeed. In which case, later attempts to write into the channel could just have a default case in a select (if the first write was successful - then there's something in the channel and we can return safely since one write has happened. if it was already read - we can write again, regardless if it would be read again). Also, since the BMT hashers are pooled, you must drain that channel before putting it back to the pool, otherwise if there was a value in the channel it may result in a corrupted bmt hash for a later chunk reusing that hasher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants