Skip to content

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Nov 13, 2025

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Fixes a race condition in pkg/file/joiner that caused CI test timeouts. Refactored decoderCache.GetOrCreate to prevent deadlocks by correctly managing mutex locks around getter.New calls.

Makefile: Filter linker warnings and disable test caching for test-ci-race

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

@gacevicljubisa gacevicljubisa requested a review from janos November 13, 2025 10:51
@gacevicljubisa gacevicljubisa marked this pull request as ready for review November 13, 2025 10:52
@gacevicljubisa gacevicljubisa changed the base branch from feat/soc-dispersed-v3 to master November 13, 2025 10:58
var eg errgroup.Group
j.readAtOffset(buffer, j.rootData, 0, j.span, off, 0, readLen, &bytesRead, j.rootParity, &eg)
eg, ectx := errgroup.WithContext(j.ctx)
j.readAtOffset(buffer, j.rootData, 0, j.span, off, 0, readLen, &bytesRead, j.rootParity, eg, ectx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ectx==ctx and so therefore it is not needed to add it to the signature, since readAtOffset can just select on the struct field

return ectx.Err()
default:
}
ch, err := g.Get(ectx, addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Get method already gets passed the context, shouldn't/wouldn't it better that it just checks whether the context expired before+after doing anything? this way you get better coverage of the whole codebase instead of having to dot every usage of the method elsewhere.

recovery := func() storage.Getter {
g.config.Logger.Debug("lazy-creating recovery decoder after fetch failed", "key", key)

if d, ok := g.getFromCache(key); ok && d != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since getter.New should return almost immediately, wouldn't it be better to just lock once and do both the get and the set, instead of locking twice, checking twice and potentially calling (and allocating) in vain to getter.New?
Bear in mind also that getter.New has side-effects as it starts a whole set of goroutines with go d.prefetch(). It would be good to avoid it if it's not necessary.

}

key := fingerprint(addrs)
g.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2-cents: since one mutex lock-unlock cycle turns into multiple, the chances of having race conditions increases. Since multiple callers may have now access to the same critical section at the same time, this might cause other problems elsewhere down the line (once some of the interleaving logic changes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants