Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for dynamic queue sizes #1985

Merged
merged 1 commit into from
Jan 23, 2020
Merged

Added support for dynamic queue sizes #1985

merged 1 commit into from
Jan 23, 2020

Conversation

jpkrohling
Copy link
Contributor

Which problem is this PR solving?

Short description of the changes

  • Introduces the new flags: collector.dyn-queue-size-warmup and collector.dyn-queue-size-memory.
  • When the warm-up flag is set, the span processor starts a new background process to recalculate the queue capacity at regular intervals.
  • The span processor now keeps track of the number of spans it has seen, as well as the total bytes it processed. That information allows to derive the average spans size that the collector has processed. Along with the memory to allocate, we can come up with the ideal queue size.
  • New metrics: jaeger_collector_queue_capacity and jaeger_collector_spans_bytes.

This commit shouldn't cause performance degradation for instances not using the new options, except for a small memory overhead due to the new span tracker (quantity/size) and the new metrics.

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

@jpkrohling
Copy link
Contributor Author

Two extra comments:

  1. I removed CGO_DISABLED=0, as I need CGO to get the amount of available memory. I couldn't figure out why it has been set this way, and it seems to be introduced by Add standalone Dockerfile that runs agent, collector, and query #69 without further comment. If there's a reason to have CGO disabled, let me know and I'll try to come up with another way to obtain the amount of memory available (no promises, as I did some research already and the other options seem worse than the current approach)
  2. This can be tested with:
$ go run -tags ui ./cmd/all-in-one/main.go  --collector.dyn-queue-size-warmup 1000 --memory.max-traces 10000
$ go run ./cmd/tracegen/main.go -workers 10 -duration 5m

There should be both a change recorded in the log and in the metric jaeger_collector_queue_capacity.

Makefile Outdated Show resolved Hide resolved
cmd/collector/app/options.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
cmd/collector/app/options_test.go Outdated Show resolved Hide resolved
@@ -130,6 +154,11 @@ func (sp *spanProcessor) saveSpan(span *model.Span) {
sp.metrics.SaveLatency.Record(time.Since(startTime))
}

func (sp *spanProcessor) countSpan(span *model.Span) {
atomic.AddUint64(&sp.bytesProcessed, uint64(span.Size()))
atomic.AddUint64(&sp.spansProcessed, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make sure these fields are properly aligned. Consider using uber-go/atomic instead, which we already depend on.

cmd/collector/app/span_processor.go Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor_test.go Outdated Show resolved Hide resolved
pkg/queue/bounded_queue.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked some discussions as resolved based on a code that I have locally. I'll push it tomorrow, but wanted to release the comments to other items before that.

cmd/collector/app/options_test.go Outdated Show resolved Hide resolved
cmd/collector/app/options.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
dynQueueSizeMemory: options.dynQueueSizeMemory,
}

processSpanFuncs := []ProcessSpan{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the order matter? I don't think it would, but it's worth asking...

cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Show resolved Hide resolved
pkg/queue/bounded_queue.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 7, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f899363). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1985   +/-   ##
=========================================
  Coverage          ?   97.43%           
=========================================
  Files             ?      207           
  Lines             ?    10314           
  Branches          ?        0           
=========================================
  Hits              ?    10049           
  Misses            ?      222           
  Partials          ?       43
Impacted Files Coverage Δ
cmd/collector/app/builder/builder_flags.go 100% <100%> (ø)
cmd/collector/app/span_processor.go 100% <100%> (ø)
cmd/collector/app/metrics.go 100% <100%> (ø)
cmd/collector/app/builder/span_handler_builder.go 100% <100%> (ø)
cmd/collector/app/options.go 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f899363...012056c. Read the comment docs.

@jpkrohling jpkrohling changed the title Added support for dynamic queue sizes 🎅🎄 Added support for dynamic queue sizes Jan 7, 2020
@jpkrohling jpkrohling changed the title Added support for dynamic queue sizes WIP - Added support for dynamic queue sizes Jan 7, 2020
@jpkrohling
Copy link
Contributor Author

jpkrohling commented Jan 7, 2020

Marked as WIP, as it depends on #2007 and #2008.

@jpkrohling jpkrohling changed the title WIP - Added support for dynamic queue sizes Added support for dynamic queue sizes Jan 9, 2020
@jpkrohling jpkrohling requested a review from yurishkuro January 9, 2020 13:55
@jpkrohling
Copy link
Contributor Author

Perhaps someone else could review this one? @vprithvi ?

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot escape the feeling that we're introducing a substantial amount of code for something that only provides an approximate solution to the problem "I want my queue to use at most N bytes". Would it really be much more difficult to write a manual queue (without using a bound channel) that has the precise upper bound on the total memory consumption? E.g. we start with a circular buffer of some configurable initial capacity, and just count how many bytes are stored in it using Span.Size(). If we need to store more spans while total memory limit is not reached, we double the buffer capacity. If we try to append a span that exceeds the cumulative memory size, then it's time to drop it. We use sync.Cond to notify consuming goroutines of new spans available.

cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
@yurishkuro
Copy link
Member

cf #2013

@jpkrohling
Copy link
Contributor Author

E.g. we start with a circular buffer of some configurable initial capacity, and just count how many bytes are stored in it using Span.Size().

@pavolloffay had the same suggestion last week. We could do some tests, but I think this would need synchronization in a few extra places than just around the two counters from this PR. In the end, it might be worth it, in the name of precision. My original idea though wasn't to provide something that consumes at most X bytes, but to provide something with a limit of approximately X bytes. Given that the queue would be empty most of the time, this is a good compromise between precision and performance.

In any case, I see you opened a PR with your suggestion, and given that two maintainers are aligned on the idea that the linked list is a better approach, I'm closing this one.

@jpkrohling jpkrohling closed this Jan 13, 2020
@jpkrohling
Copy link
Contributor Author

Reopening, as discussed in #2013

@jpkrohling jpkrohling reopened this Jan 17, 2020
Copy link
Contributor Author

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated, addressing the last comments from @yurishkuro

cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

@yurishkuro, do you think it's worth merging this, and later change the code to use the new queue once its ready?

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern with merging it ahead of true memory bound queue is that it will introduce the warm-up flag which won't be applicable later. We already have queue length flag, maybe it can serve two purposes? Is is really important for the user to control the warm-up length?

@@ -58,9 +58,13 @@ type SpanProcessorMetrics struct {
InQueueLatency metrics.Timer
// SpansDropped measures the number of spans we discarded because the queue was full
SpansDropped metrics.Counter
// SpansBytes records how many bytes were processed
SpansBytes metrics.Gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment sounds like this should be a counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it, but I'm keeping track of this in a local var, and this metric reflects the value of that var.

cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

The only concern with merging it ahead of true memory bound queue is that it will introduce the warm-up flag which won't be applicable later. We already have queue length flag, maybe it can serve two purposes? Is is really important for the user to control the warm-up length?

Great question. We took the presence of the warmup value as the hint to activate this feature, and calculated a reasonable default for the queue size in MiB, so that people would only have to set the warmup flag. Now that we removed the default for the queue size in MiB, users have to set it anyway, so, we could activate the feature based on that instead.

I'm OK in leaving the warmup flag out for now, and see if people complain about the fixed warmup value (= queue size).

cmd/collector/app/builder/builder_flags.go Outdated Show resolved Hide resolved
}

// resizing is a costly operation, we only perform it if we are at least 20% apart from the desired value
if diff > 1.2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment, although consider moving 1.2 into a meaningfully named const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I find the current way more readable. If others agree with you, I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raw number is easier to read, but the trade-off is that named constants at the top of the file hint at available parameterization, which is otherwise hard to spot.

@@ -27,6 +29,9 @@ import (
"github.com/jaegertracing/jaeger/storage/spanstore"
)

// if this proves to be too low, we can increase it
const maxQueueSize = 1_000_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is still experimental, would it help to make this a "default" max queue size which is overridable via a collector option/configuration; much like java's JVM max heap size configurability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could allow customization of this value, but I'd do it only if users ask. The queue is supposed to be close to empty most of the time, and we need a hard cap only to ensure we don't get wild values by a bug somewhere.

cmd/collector/app/span_processor_test.go Show resolved Hide resolved
cmd/collector/app/span_processor_test.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

PR updated, addressing the recent comments.

cmd/collector/app/builder/builder_flags.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

Final test:

$ go run -tags ui ./cmd/all-in-one/main.go --collector.queue-size-memory 100 --memory.max-traces 10000
...
...
{"level":"info","ts":1579770160.554552,"caller":"app/span_processor.go:130","msg":"Dynamically adjusting the queue size at runtime.","memory-mib":100,"queue-size-warmup":2000}
...
...

In another terminal:

$ go run ./cmd/tracegen/main.go -workers 10 -duration 5m

After a while, this is shown in the collector's logs:

{"level":"info","ts":1579770220.5560262,"caller":"app/span_processor.go:272","msg":"Resizing the internal span queue","new-size":335008,"average-span-size-bytes":313}

And here are the new metrics:

# HELP jaeger_collector_queue_capacity queue-capacity
# TYPE jaeger_collector_queue_capacity gauge
jaeger_collector_queue_capacity{host="caju"} 335008

# HELP jaeger_collector_spans_bytes spans.bytes
# TYPE jaeger_collector_spans_bytes gauge
jaeger_collector_spans_bytes{host="caju"} 1.687953831e+09

* Introduces the new flags: collector.dyn-queue-size-warmup and collector.dyn-queue-size-memory.
* When the warm-up flag is set, the span processor starts a new background process to recalculate the queue capacity at regular intervals.
* The span processor now keeps track of the number of spans it has seen, as well as the total bytes it processed. That information allows to derive the average spans size that the collector has processed. Along with the memory to allocate, we can come up with the ideal queue size.
* New metrics: jaeger_collector_queue_capacity and jaeger_collector_spans_bytes.

This commit shouldn't cause performance degradation for instances not using the new options, except for a small memory overhead due to the new span tracker (quantity/size) and the new metrics.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory-sensitive collector queue size
3 participants