-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for dynamic queue sizes #1985
Conversation
Two extra comments:
$ go run -tags ui ./cmd/all-in-one/main.go --collector.dyn-queue-size-warmup 1000 --memory.max-traces 10000
$ go run ./cmd/tracegen/main.go -workers 10 -duration 5m There should be both a change recorded in the log and in the metric |
cmd/collector/app/span_processor.go
Outdated
@@ -130,6 +154,11 @@ func (sp *spanProcessor) saveSpan(span *model.Span) { | |||
sp.metrics.SaveLatency.Record(time.Since(startTime)) | |||
} | |||
|
|||
func (sp *spanProcessor) countSpan(span *model.Span) { | |||
atomic.AddUint64(&sp.bytesProcessed, uint64(span.Size())) | |||
atomic.AddUint64(&sp.spansProcessed, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to make sure these fields are properly aligned. Consider using uber-go/atomic instead, which we already depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I marked some discussions as resolved based on a code that I have locally. I'll push it tomorrow, but wanted to release the comments to other items before that.
cmd/collector/app/span_processor.go
Outdated
dynQueueSizeMemory: options.dynQueueSizeMemory, | ||
} | ||
|
||
processSpanFuncs := []ProcessSpan{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the order matter? I don't think it would, but it's worth asking...
Codecov Report
@@ Coverage Diff @@
## master #1985 +/- ##
=========================================
Coverage ? 97.43%
=========================================
Files ? 207
Lines ? 10314
Branches ? 0
=========================================
Hits ? 10049
Misses ? 222
Partials ? 43
Continue to review full report at Codecov.
|
Perhaps someone else could review this one? @vprithvi ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot escape the feeling that we're introducing a substantial amount of code for something that only provides an approximate solution to the problem "I want my queue to use at most N bytes". Would it really be much more difficult to write a manual queue (without using a bound channel) that has the precise upper bound on the total memory consumption? E.g. we start with a circular buffer of some configurable initial capacity, and just count how many bytes are stored in it using Span.Size(). If we need to store more spans while total memory limit is not reached, we double the buffer capacity. If we try to append a span that exceeds the cumulative memory size, then it's time to drop it. We use sync.Cond to notify consuming goroutines of new spans available.
cf #2013 |
@pavolloffay had the same suggestion last week. We could do some tests, but I think this would need synchronization in a few extra places than just around the two counters from this PR. In the end, it might be worth it, in the name of precision. My original idea though wasn't to provide something that consumes at most X bytes, but to provide something with a limit of approximately X bytes. Given that the queue would be empty most of the time, this is a good compromise between precision and performance. In any case, I see you opened a PR with your suggestion, and given that two maintainers are aligned on the idea that the linked list is a better approach, I'm closing this one. |
Reopening, as discussed in #2013 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated, addressing the last comments from @yurishkuro
@yurishkuro, do you think it's worth merging this, and later change the code to use the new queue once its ready? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern with merging it ahead of true memory bound queue is that it will introduce the warm-up flag which won't be applicable later. We already have queue length flag, maybe it can serve two purposes? Is is really important for the user to control the warm-up length?
@@ -58,9 +58,13 @@ type SpanProcessorMetrics struct { | |||
InQueueLatency metrics.Timer | |||
// SpansDropped measures the number of spans we discarded because the queue was full | |||
SpansDropped metrics.Counter | |||
// SpansBytes records how many bytes were processed | |||
SpansBytes metrics.Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment sounds like this should be a counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like it, but I'm keeping track of this in a local var, and this metric reflects the value of that var.
Great question. We took the presence of the warmup value as the hint to activate this feature, and calculated a reasonable default for the queue size in MiB, so that people would only have to set the warmup flag. Now that we removed the default for the queue size in MiB, users have to set it anyway, so, we could activate the feature based on that instead. I'm OK in leaving the warmup flag out for now, and see if people complain about the fixed warmup value (= queue size). |
cmd/collector/app/span_processor.go
Outdated
} | ||
|
||
// resizing is a costly operation, we only perform it if we are at least 20% apart from the desired value | ||
if diff > 1.2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment, although consider moving 1.2
into a meaningfully named const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny, I find the current way more readable. If others agree with you, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think raw number is easier to read, but the trade-off is that named constants at the top of the file hint at available parameterization, which is otherwise hard to spot.
cmd/collector/app/span_processor.go
Outdated
@@ -27,6 +29,9 @@ import ( | |||
"github.com/jaegertracing/jaeger/storage/spanstore" | |||
) | |||
|
|||
// if this proves to be too low, we can increase it | |||
const maxQueueSize = 1_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is still experimental, would it help to make this a "default" max queue size which is overridable via a collector option/configuration; much like java's JVM max heap size configurability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could allow customization of this value, but I'd do it only if users ask. The queue is supposed to be close to empty most of the time, and we need a hard cap only to ensure we don't get wild values by a bug somewhere.
PR updated, addressing the recent comments. |
Final test:
In another terminal:
After a while, this is shown in the collector's logs:
And here are the new metrics:
|
* Introduces the new flags: collector.dyn-queue-size-warmup and collector.dyn-queue-size-memory. * When the warm-up flag is set, the span processor starts a new background process to recalculate the queue capacity at regular intervals. * The span processor now keeps track of the number of spans it has seen, as well as the total bytes it processed. That information allows to derive the average spans size that the collector has processed. Along with the memory to allocate, we can come up with the ideal queue size. * New metrics: jaeger_collector_queue_capacity and jaeger_collector_spans_bytes. This commit shouldn't cause performance degradation for instances not using the new options, except for a small memory overhead due to the new span tracker (quantity/size) and the new metrics. Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Which problem is this PR solving?
Short description of the changes
collector.dyn-queue-size-warmup
andcollector.dyn-queue-size-memory
.jaeger_collector_queue_capacity
andjaeger_collector_spans_bytes
.This commit shouldn't cause performance degradation for instances not using the new options, except for a small memory overhead due to the new span tracker (quantity/size) and the new metrics.
Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de