-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use circular buffer in FiniteReplayProvider #23
Conversation
Uses NewFiniteReplayProvider() to instantiate the provider instead of direct literal. This allows catching count misconfiguration at instantiation (typically application startup) instead as a later panic. Synchronises access to the underlying buffer and head and tail state using an RWMutex.
If you don't like the NewFiniteReplayProvider/error-approach it should be removed and made consistent with ValidReplayProvider instead. Likewise I don't think that these changes should be released/merged without consistently switching to errors instead of panics. |
Thank you for the PR!
This is done inside the provider, replay providers are not required to be thread safe. There's no need to change that yourself, though. I wanted to have this code mainly as a reference for benchmarks and tests. Besides, the
My take on this would be that these panics are caused by programming issues rather than normal execution which means they are bugs. For example, dispatching a
There is a case to be made about errors related to topics – while in most scenarios I do think they would be provided statically, I can imagine some where they would be pulled from a data source. I'd say that even then, if the data source gives nothing back, it means the data is corrupted and that its errors were not handled. Though it would be prudent to not crash the whole program in these cases. Not panicking would allow for better error handling in unusual cases, allow for custom error logging and, most importantly, not crash the program. What do you think? Have you encountered in your use a scenario where you'd have preferred if errors were returned instead?
Totally agree. The changes will be cohesive. |
👍 Sounds good!
Buffer sizes are often something I would make configurable through f.ex. environment variables. One could argue that it's still a bug as the programmer failed to properly validate the input 🤷 but it's definitely not a static one. Likewise, the topics could be derived from the data the message represents, so dynamic rather than static. That's definitely the case for me: https://github.com/ttab/elephant-repository/blob/main/repository/sse.go#L234-L239
I always prefer errors :) and never expect code to panic. If it does it should be from programmer errors, like forgetting to nil-check some input and causing a nil panic, unforeseen errors. But almost never from writing code that explicitly calls Returning errors clearly communicates that a call can fail, and allows the caller to decide how they want to act when that happens. In my use-case I would probably just log that and move on with things, for somebody else that might be a show-stopper. Errors both clearly tells the caller that that's a decision that they need to make, and makes it easy for them to do so. |
You're right here – the argument falls flat whenever there's a new layer introduced – be it some other code or an external configuration source. Otherwise pretty much every error happens because somebody didn't validate something somewhere 😃
That's a better definition for "programming errors" than what I had in mind. I think I'm convinced on this – I'll adapt the API to use errors and constructors. Constructors also simplify the library code, as no init logic would need to be done anymore. Given that this is a pretty big set of changes I'll probably roll them out across multiple versions (something like Thank you for your input and time! |
Great! That sounds like a sensible & responsible plan 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to try and merge this PR because even if it is on master
the API consistency wouldn't matter, as I won't tag a new version. I'd also wish to have your merits and efforts recognized – by merging this you'd appear as a contributor.
It would be of tremendous help if you have the time and will to look over the comments I've left. If not, it's no biggie – these are just as well notes to myself and no time is wasted.
With these changes the following content should be added to CHANGELOG.md
:
## Unreleased
### Removed
- `FiniteReplayProvider.{Count, AutoIDs}` – use the constructor instead
### Added
- `NewFiniteReplayProvider` constructor
### Fixed
- `FiniteReplayProvider` doesn't leak memory anymore and respects the stored messages count it was given. Previously when a new message was put after the messages count was reached and some other messages were removed, the total messages count would grow unexpectedly and `FiniteReplayProvider` would store and replay more events than it was configured to.
It doesn't have to be copied verbatim – if you have a better description, I'm open to it!
Co-authored-by: Teodor Maxim <57960185+tmaxmax@users.noreply.github.com>
TestFiniteReplayProvider had to be adjusted, as it assumed that the event with the ID 4 would still be present after pushing events {5,6,7} into a finite replay provider with a count of 3. TestJoe_errors is failing, but I'm a bit unsure as to what it's actually testing.
Pushed new commit:
|
Where is that test failing? I should be able to help with that. |
|
The test verifies the behavior of
I think what happens here is that in the new implementation no To fix this the previous behavior of the internal |
Yeah, that behaviour is a bit odd. Or, well, I guess it’s to provide playback for as long as it’s actually possible, but it complicates the logic around replays a bit: we can replay as long as the last event is retained OR if it was the last event to get evicted. I wonder if it’s worth that extra +1 of replay capability. It explains the behaviour in the test I changed, it didn’t assume that more than 3 events were retained. I wonder if there’s some convention around this behaviour (replay all vs. replay none) when it comes to implementations? And is there any way to communicate to the client that the last observed event is too old for replay? Either way, in the worst case, determining that we don’t have the event retained would require a full loop through the buffer before we actually start sending stuff. |
Let's leave the code as it is for now. I'll do some research to see what other libraries – in and outside the Go ecosystem – implement. As a curiosity, given that you're using this replay provider, what was the behavior you would have expected, if you've thought of it before? EDIT: Went through pretty much all libraries listed on Wikipedia and I could only find one single framework which handled replays (an SSE implementation for Django). This library saves events for 24 hours to some persistent storage... which is clearly out of scope for the core library. Let's drop that extra event for now – if there is demand for it somebody is bound to open an issue 😃 It's also kind of arbitrary to save just one single event – there could be some old event IDs buffer (but at this point just grow the provider's capacity) or a toggle like |
I’ve mostly been concerned about how to communicate to the client that the ID is too old to support resume. Found this issue: whatwg/html#8297 So maybe the best thing would be a hook or config that allows sending a message with an application specific “out of sync” event type. |
Hm, this is indeed important. Perhaps there should be a specific kind of error The API could look like: var ErrReplayInvalidID error
type Server struct {
OnError func(r *http.Request, sub Subscription, err error) (newSub Subscription, retry bool)
} With this hook I could also get rid of the (note to self: the sentinel errors are kind of piling up; there may be some potential of unifying them into a single error type) What do you think? The solution tries to be as generic as possible because given that this is not regulated by the spec there is no one single way to do it. As discussed in the issue you've linked – thanks for finding that btw! – some may prefer headers, some think messages are better and so on. I believe |
Looks good, and grouping the errors into a single type could allow some more convenient error checking. I'm quite partial to using error codes. It has the benefit of making the things you check (type and code) constants instead of variables, and linters can warn if you're doing a non-exhaustive switch: type ErrorCode string
const (
ErrorCodeInvalidID ErrorCode = "invalid_id"
ErrorCodeNoTopic ErrorCode = "no_topic"
)
func NewError(code ErrorCode, text string) *Error {
return &Error{
text: text,
Code: code,
}
}
func NewErrorf(code ErrorCode, format string, a ...any) *Error {
err := fmt.Errorf(format, a...)
return &Error{
text: err.Error(),
cause: errors.Unwrap(err),
Code: code,
}
}
type Error struct {
text string
cause error
Code ErrorCode
}
func (e *Error) Error() string {
return e.text
}
func (e *Error) Unwrap() error {
return e.cause
}
func IsErrorWithCode(err error, code ErrorCode) bool {
var e *Error
return errors.As(err, &e) && e.Code == code
}
func main() {
err := NewError(ErrorCodeInvalidID, "dunno what that is")
if IsErrorWithCode(err, ErrorCodeInvalidID) {
println("is invalid ID")
} else if err != nil {
panic("waah!")
}
var sseErr *Error
if !errors.As(err, &sseErr) {
panic("unknown error!")
}
switch sseErr.Code {
case ErrorCodeInvalidID:
println("switch thinks this is an invalid id")
case ErrorCodeNoTopic:
println("switch thinks this is something that lacks a topic")
}
} ...it opens up to both including root causes and adding fields carrying relevant information about the error, like the event that triggered the error. |
Do you want me to do any more changes to the PR. For example the last deleted ID feature is a bit up in the air at the moment. Should I add support for it, or do you want to merge & maybe work on removing it completely before next release? |
I'll remove that feature for the moment. If you have the resources, could you add the suggested test in the code review? |
Of course. I’ve never done any allocation checking in tests before, so it’ll be a good exercise. Though I don’t know if it’ll explicitly catch memory leaks, as we only track allocations, not deallocations. An infinitely growing slice would definitely show up though, so 👍 |
We don't really have a way to track deallocs, given that those are the GCs responsibility and it decides when to do them. Besides the replay provider is usually part of a long-running service, so the deallocation won't really ever happen. What is important to track is, as you said, that the slice doesn't grow. |
Done! I can see why you chose to add a lastRemovedID. When working with small buffers in tests it feels strange that the first item only can be used as a reference point for replaying the items that follow it. But that's less of a thing with more realistic replay counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some very minor stuff – about three lines of code. Very good work!
Co-authored-by: Teodor Maxim <57960185+tmaxmax@users.noreply.github.com>
update test to check NewFiniteReplayProvider() count error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fix CI
That graph looks really good now! I assume the performance of the new version appears in the graph after 8 AM. |
Yep! 😄 I fixed the linting and test errors. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23 +/- ##
==========================================
- Coverage 93.10% 89.31% -3.80%
==========================================
Files 14 14
Lines 1117 1160 +43
==========================================
- Hits 1040 1036 -4
- Misses 56 94 +38
- Partials 21 30 +9 ☔ View full report in Codecov by Sentry. |
Uses NewFiniteReplayProvider() to instantiate the provider instead of direct literal. This allows catching count misconfiguration at instantiation (typically application startup) instead as a later panic. In my opinion all panics should be replaced with returning errors, all uses of panic I looked at in the library are in the "expected errors" category. Keeping panic behaviour in Put(), as changing that is very much out of scope.
Synchronises access to the underlying buffer and head and tail state using an RWMutex. I haven't looked closely at the concurrency handling in the library as a whole, so if you're already serialising access to the replay providers this doesn't add any value. If not, a similar mutex should be added to the ValidReplayProvider.