Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay rate limits #906

Merged
merged 15 commits into from
Nov 19, 2024
Merged

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

Adds server side rate limiting to the relay.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
@cody-littley cody-littley self-assigned this Nov 18, 2024
Signed-off-by: Cody Littley <cody@eigenlabs.org>
relay/config.go Outdated
Comment on lines 81 to 83
// MaxKeysPerGetChunksRequest is the maximum number of keys that can be requested in a single GetChunks request.
// Default is 1024. // TODO should this be the max batch size? What is that?
MaxKeysPerGetChunksRequest int
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-shim what is the maximum number of keys we should permit in a single request?

Copy link
Contributor

@ian-shim ian-shim Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start with something like 512? It will probably hit the grpc limit anyways if it goes beyond that.
Actually let's leave this with 1024 and talk about this offline.

@@ -177,7 +177,7 @@ func buildChunkStore(t *testing.T, logger logging.Logger) (chunkstore.ChunkReade

func randomBlob(t *testing.T) (*v2.BlobHeader, []byte) {

data := tu.RandomBytes(128)
data := tu.RandomBytes(225) // TODO talk to Ian about this
Copy link
Contributor Author

@cody-littley cody-littley Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-shim a blob size of 128 bytes results in 5 commitment length of 5, which the method to determine the length of a chunk says is illegal because it's not a power of 2. Can you explain this to me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a requirement v2 started imposing - to force the blob length to be a power of 2. GetCommitments method should be refactored so that it automatically outputs a length in power of 2. @mooselumph is working on that.

Signed-off-by: Cody Littley <cody@eigenlabs.org>
@cody-littley cody-littley changed the title Relay rate limits2 Relay rate limits Nov 18, 2024
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
relay/config.go Outdated
MaxKeysPerGetChunksRequest int

// RateLimits contains configuration for rate limiting.
RateLimits limiter.Config
}

// DefaultConfig returns the default configuration for the relay Server.
func DefaultConfig() *Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move this to test file? actual default values for optional fields can be set in flags.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

relay/config.go Outdated
Comment on lines 81 to 83
// MaxKeysPerGetChunksRequest is the maximum number of keys that can be requested in a single GetChunks request.
// Default is 1024. // TODO should this be the max batch size? What is that?
MaxKeysPerGetChunksRequest int
Copy link
Contributor

@ian-shim ian-shim Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start with something like 512? It will probably hit the grpc limit anyways if it goes beyond that.
Actually let's leave this with 1024 and talk about this offline.

return fmt.Errorf("global concurrent request limit exceeded for getBlob operations, try again later")
}

allowed := l.opLimiter.AllowN(now, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from opLimiter.Allow()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is that the time is supplied from the outside context. This allows me to write unit tests that aren't wall clock dependent.

Comment on lines 53 to 57
countInFlight := l.operationsInFlight.Add(1)
if countInFlight > int64(l.config.MaxConcurrentGetBlobOps) {
l.operationsInFlight.Add(-1)
return fmt.Errorf("global concurrent request limit exceeded for getBlob operations, try again later")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this whole sequence be atomic? otherwise, can operationsInFlight go over MaxConcurrentGetBlobOps and potentially reject valid operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation where this logic may fail is if the request rate is so high that there are always MaxConcurrentGetBlobOps or more goroutines that happen to be inside this one block of code, which could in theory prevent any actual operations from being served.

Although such a scenario is unlikely, probably better to err on the side of safety. This now uses a mutex to guarantee atomicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around with this a little, the code gets really ugy when I try and to low resolution locking. Methods in this class (as well as ChunkRateLimiter) are just doing basic arithmetic... let's just use a single mutext to protect the entire class for the sake of simplicity.

@@ -177,7 +177,7 @@ func buildChunkStore(t *testing.T, logger logging.Logger) (chunkstore.ChunkReade

func randomBlob(t *testing.T) (*v2.BlobHeader, []byte) {

data := tu.RandomBytes(128)
data := tu.RandomBytes(225) // TODO talk to Ian about this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a requirement v2 started imposing - to force the blob length to be a power of 2. GetCommitments method should be refactored so that it automatically outputs a length in power of 2. @mooselumph is working on that.

"too many chunk requests provided, max is %d", s.config.MaxKeysPerGetChunksRequest)
}

// Future work: client IDs will be fixed when authentication is implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // TODO(cody-littley): ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// Future work: client IDs will be fixed when authentication is implemented
clientID := fmt.Sprintf("%d", request.RequesterId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we allow unauthenticated traffic or all requests need to be signed (for MVP)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan was to only permit authenticated traffic for GetChunks(). Is there any use case for allowing unauthenticated traffic? I can add it if you think it would be helpful.


if req.GetByIndex() != nil {
key := v2.BlobKey(req.GetByIndex().GetBlobKey())
metadata = mMap[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the key exists in the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be possible, since process will fail before this point if there is missing metadata. But it's a useful safety check, so I don't mind adding it.

relay/server.go Show resolved Hide resolved
relay/metadata_provider.go Show resolved Hide resolved
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}

if l.globalOperationsInFlight >= l.config.MaxConcurrentGetChunkOps {
return fmt.Errorf("global concurrent request limit exceeded for GetChunks operations, try again later")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be useful to include the values (globalOperationsInFlight & MaxConcurrentGetChunkOps) in the error messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to this error message and all similar error messages

return fmt.Errorf("global rate limit exceeded for GetChunk bandwidth, try again later")
}

allowed = l.perClientBandwidthLimiter[requesterID].AllowN(now, bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know l.perClientBandwidthLimiter[requesterID] should exist because this method always gets called after BeginGetChunkOperation, but this library probably shouldn't assume that. Let's check if l.perClientBandwidthLimiter[requesterID] exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added sanity check

@@ -153,8 +155,15 @@ func (m *metadataProvider) fetchMetadata(key v2.BlobKey) (*blobMetadata, error)
}
}

blobSize := uint32(cert.BlobHeader.BlobCommitments.Length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad. I meant BlobMetadata. Otherwise, we can also multiply Length by BYTES_PER_SYMBOL, but this will overestimate the size

@cody-littley cody-littley merged commit afd5894 into Layr-Labs:master Nov 19, 2024
6 checks passed
@cody-littley cody-littley deleted the relay-rate-limits2 branch November 19, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants