Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does buffered Channel need to pre-allocate all memory upfront? #11572

Open
oprypin opened this issue Dec 11, 2021 · 26 comments
Open

Does buffered Channel need to pre-allocate all memory upfront? #11572

oprypin opened this issue Dec 11, 2021 · 26 comments

Comments

@oprypin
Copy link
Member

oprypin commented Dec 11, 2021

Talking about this:

crystal/src/channel.cr

Lines 171 to 180 in a9ee750

def initialize(@capacity = 0)
@closed = false
@senders = Crystal::PointerLinkedList(Sender(T)).new
@receivers = Crystal::PointerLinkedList(Receiver(T)).new
if capacity > 0
@queue = Deque(T).new(capacity)
end
end

Channel's implementation is backed by a Deque which can certainly grow on demand. However, its ability to grow is not used: it's allocated with the maximal capacity of the buffer upfront.

I understand that channels are usually meant to only have a small buffer, but maybe I consciously want a larger buffer of 10000. And then what if I want to create 1000 channels of that size, knowing that, the way my application works, only very few of them will ever reach that size (maybe I even have a way to ensure that). I don't want to come out of that with tens of megabytes of mostly unused memory anyway.

So I propose to let the users choose to grow the channel on demand. In terms of implementation, as far as I see, the only thing that needs to happen is changing this:

-     @queue = Deque(T).new(capacity) 
+     @queue = Deque(T).new()

because everything else checks the channel's capacity explicitly against the current .size.
So mainly I think it's a pity that such a potentially useful feature is locked away behind a one-line change.

Who knows, maybe in terms of performance it's not even that helpful to always pre-allocate the buffer. Channels probably tend to live long whereas the growth is one-time.
So maybe the default could even be to not pre-allocate the buffer. But mainly I'm suggesting a configurable parameter - either a boolean or an integer to be passed to the constructor.

Again, maybe you'll lament the fact that I even mention this, but I don't see why Channel.new(Int32::MAX) shouldn't be possible. It's not like people usually intentionally choose an upper bound on an array's size, so why force it for channels?

@yxhuvud
Copy link
Contributor

yxhuvud commented Dec 11, 2021

I think this is a fundamentally misguided proposal, and that this would make the performance and expected latency of properly used channels worse, while at the time being a big honeypot for allowing developers not experienced in building robust networked systems to mess up their designs with regards to back-pressure and system stability. Unbounded channels will just mean the underlying problems will be moved from a relatively visible place (blocking writes to the channel) to a relatively invisible place (a backlog that slowly grows until the system get a latency curve formed like a hockey stick and then crashes). The suggestion doesn't solve the underlying problem though, which is that the channel consumer doesn't process the data fast enough.

maybe I consciously want a larger buffer of 10000. And then what if I want to create 1000 channels of that size,

Channels probably tend to live long.

These claims are conflicting. If you do this setup with long lived channels, then you will get a behavior that looks like a long term memory leak, as the memory allocated for the channels will grow over time and never be released. Channels are a means for synchronization and communication, not for bloating memory usage by means you don't have any good tools inspecting.

If there is need for a threadsafe queue then I suggest someone actually builds that instead of making channels harder to reason about and making them worse for their primary purpose.

but I don't see why Channel.new(Int32::MAX) shouldn't be possible.

It is possible. It just forces you to pay for what you use it for. Lazy allocation is a problem for what channels are good for as they make a problem with too much memory usage on overload even harder to deal with than it already is.

Suggested further reading:
https://www.tedinski.com/2019/03/05/backpressure.html
https://dzone.com/articles/applying-back-pressure-when
https://ferd.ca/queues-don-t-fix-overload.html

@wyhaines
Copy link
Contributor

@oprypin, there is also a behavioral interrelationship with this line:

https://github.com/crystal-lang/crystal/blob/master/src/channel.cr#L258

You can do a really simple implementation that monkey patches the existing library to eliminate an explicit, preallocated limit by doing this:

class NBC(T) < Channel(T)
  def initialize(@capacity = Int32::MAX)
    @closed = false

    @senders = Crystal::PointerLinkedList(Sender(T)).new
    @receivers = Crystal::PointerLinkedList(Receiver(T)).new

    @queue = Deque(T).new
  end
end

Without a set capacity to trigger the condition in the previously mentioned line, though, the nonblocking behavior doesn't work as expected.

My implementation is based very closely on the stdlib Channel, with some of the code paths for the blocking behavior changed/eliminated, but the above works just fine to test the concept. It was just long enough ago that I wrote the code, though, that I don't remember why I went through the work to refactor some of the other bits instead of just going with the above.

@jgaskins
Copy link
Contributor

jgaskins commented Jun 9, 2022

I created #12098 for this exact same purpose because I missed this issue when searching. I'm encountering this issue because the design of NATS requires that each subscription have a channel so that you aren't handling messages inline. That channel needs to be deep enough to handle pretty intense spikes in throughput because the NATS server will disconnect you if you block subsequent messages for more than a few seconds.

The Go client for NATS sets the channel capacity at 512k by default, but that would be way too big for the current implementation of Crystal channels, so I have it set at 64k for my Crystal NATS client, and even that allocates 3-4MB per subscription.

@oprypin's mention of 1000 channels, 10k-deep, isn't unrealistic here. 1000 is on the high end but is not an unreasonable amount of NATS subscriptions in a coarse-grained service. To illustrate the impact that this pre-allocation has for this use case, this is what 1000 NATS subscribers looks like before any messages are sent or received (code is in #12098):

Screenshot of Activity Monitor showing 4.34GB of memory used

Without pre-allocating the full channel queue capacity, we can use 99.5% less memory:

Screenshot of Activity Monitor showing 23 MB of memory used

Most of the remaining memory usage is due to spawning a fiber for each of those 1000 subscriptions. It's notable that the Go client for NATS does not exhibit this memory bloat behavior, so Go channels presumably do not allocate eagerly.

maybe I consciously want a larger buffer of 10000. And then what if I want to create 1000 channels of that size,

Channels probably tend to live long.

These claims are conflicting. If you do this setup with long lived channels, then you will get a behavior that looks like a long term memory leak, as the memory allocated for the channels will grow over time and never be released.

@yxhuvud This is only true if every channel eventually needs to make full use of its capacity. In the case of NATS (and, presumably most usage of a high quantity of high-capacity channels), the channels need to be that deep to handle exceptional throughput, not common, so the vast majority of that space is unnecessary. In fact, if you send to an empty channel on which another fiber is currently blocked on receive, the value is sent directly to that fiber and bypasses the queue entirely. If a particular channel is consumed faster than it's written to (in my use case, a low-volume subscription), this suggested change may never even allocate the buffer for those channels.

Regarding memory growth, even if memory usage remains at the high water mark (as opposed to being elastic) for a few channels, that's acceptable. It will almost likely not be all channels and the next time you deploy it (or if you perform periodic restarts) memory usage will go right back down to the typical queue depth anyway.

this would make the performance and expected latency of properly used channels worse

@yxhuvud How much worse? Do you have a benchmark? It's important to understand the magnitudes of the tradeoff being discussed.

Also, what do you mean by "properly"?

@jgaskins
Copy link
Contributor

jgaskins commented Jun 9, 2022

Another tradeoff worth noting is that if we don't pre-allocate the queue size, the current implementation of doubling when the size reaches capacity can end up going well beyond the desired channel size unless that channel size is a power of 2. Probably not a huge deal for smaller or short-lived channels, but for larger, long-lived ones, this could cause even more memory bloat.

Might be able to implement a way to customize the queue growth, though, by providing a callback proc:

class Deque(T)
  @increase_capacity : Int32 -> Int32 = ->(capacity : Int32) { capacity * 2 }

  # Tell the Deque how to increase the capacity by passing in a `Proc` that yields
  # the current capacity and returns the new capacity.
  #
  # NOTE: The return value must be greater than the value yielded.
  def increase_capacity(&@increase_capacity : Int32 -> Int32) : self
    self
  end
end

class Channel(T)
  def initialize(capacity = 0)
    # ...
    if capacity > 0
      @queue = Deque(T).new.increase_capacity { |c| {capacity, c * 2}.min} }
    end
  end
end

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 9, 2022

@jgaskins You are reading that code wrong. That 512k count is for something else. The default channel size for subscribers that they use is specified at https://github.com/nats-io/nats.go/blob/69f0e65fe4fd1843bf799b36b2278222637753a2/nats.go#L61, and is 64k. Which is still quite a lot but a lot more reasonable to set up on a per client basis.

By properly I mean by not overallocating. That is, either a few big channels or lots of small ones. Not by having lots of big ones and relying on hope for not running out of memory.

@jgaskins
Copy link
Contributor

jgaskins commented Jun 9, 2022

@jgaskins You are reading that code wrong. That 512k count is for something else. The default channel size for subscribers that they use is … 64k.

Looks like you're right, I got confused. I actually chose 64k in the Crystal client based on what was in the Go client, so I was surprised when I saw 512k in there when I looked yesterday. All things considered, though, that's not an important detail. As I mentioned above, allocating that many Go channels doesn't exhibit the same kind of memory bloat that the equivalent Crystal code does. Allocating 1000 64k-length channels in a Go app uses 10MB of RAM, but well over 3GB in Crystal. Code is included so feel free to check my work:

GoCrystal
Screenshot of the memory usage of a Go app that allocates 1000 channels with capacity of 64k NATS messages each, showing 10.2MB consumed Screenshot of the memory usage of a Crystal app that allocates 1000 channels with capacity of 64k NATS messages each, showing 3.36GB consumed
Code
package main

import (
	"bufio"
	nats "github.com/nats-io/nats.go"
	"os"
)

func main() {
	var channels [1000]chan *nats.Msg

	for i := 0; i < 1000; i++ {
		channels[i] = make(chan *nats.Msg, 64*1024)
	}

	bufio.NewReader(os.Stdin).ReadString('\n')
}
Code
require "nats"

channels = Array.new(1_000) do
  Channel(NATS::Message).new(64_000)
end

gets

With that in mind, I'm curious whether you still feel that lazily allocating that memory is bad for Crystal when it works fine for Go?

By properly I mean by not overallocating. That is, either a few big channels or lots of small ones.

Avoiding overallocating is the reason this issue was opened. We're going about it differently but we have the same goal.

Not by having lots of big ones and relying on hope for not running out of memory.

This is unnecessarily dismissive. Any time you parse JSON arrays eagerly (e.g. not with JSON::PullParser#read_array) from external input, you rely on hope that you don't run OOM. Of the entire stdlib, Channel is probably the least likely data structure to run us OOM because it blocks on a maximum capacity.

@straight-shoota
Copy link
Member

straight-shoota commented Jun 9, 2022

I see no evidence for Go employing lazy loading. It's a pure assumption based on circumstantial observation.

Looking at the Go implementation of makechan it appears to allocate the full size on initialization (see https://go.dev/src/runtime/chan.go#L107, mem is essentially elem.size * size). So that looks like it behaves very similar to the Crystal implementation.

I suspect the observed difference in memory usage comes from the fact that in your Crystal example the channel buffers Message structs, while the Go example uses pointers to Msg structs. Not sure if that explains the entire difference, but it clearly shows that the examples are not equivalent.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 9, 2022

When doing

chans = Array(Channel(Void*)).new(1000) { Channel(Void*).new(capacity: 64_000) }

sleep 100

Then it will allocate roughly 208M of resident memory for me.
But when changing the count to 2000 then it will go up to 990MB
3000: 1.4GB
4000: 1.9GB
5k: 2.4GB

This memory usage doesn't make much sense to me. There might be something weird going on, but I doubt it is as simple as preallocating the buffer vs not doing that.

@straight-shoota
Copy link
Member

straight-shoota commented Jun 9, 2022

@yxhuvud Using this example, I see very consistent heap size growth proportional to the number of channels.
1000: 515MiB
2000: 0.99GiB
4000: 1.96GiB
Perhaps your measurement for 1000 channels was flawed?

@jgaskins
Copy link
Contributor

jgaskins commented Jun 9, 2022

I see no evidence for Go employing lazy loading. It's a pure assumption based on circumstantial observation.

@straight-shoota Who's making assumptions? The Go implementation uses 10MB of RAM. If you allocate 1000 buffers to hold 64k elements you've allocated 64M elements. My only assumption was that you can't fit 64M objects into 10MB or less. If they're pointers, that comes out to 500MB (the Go client uses 64*1024 instead of a flat 64000).

If you're interested in swapping structs and references in either implementation, I've made it convenient for you to check by putting the code into the post.

Looking at the Go implementation of makechan it appears to allocate the full size on initialization (see https://go.dev/src/runtime/chan.go#L107, mem is essentially elem.size * size). So that looks like it behaves very similar to the Crystal implementation.

I read the makechan implementation before posting #12098 and I thought the same thing you did based on my reading of that code. As much as possible, though, I try to run code to check my understanding of it. Claims of performance and/or memory usage aren't useful without seeing it run.

@straight-shoota
Copy link
Member

straight-shoota commented Jun 10, 2022

I added the following code to the Go test to read the allocated heap memory:

var stats runtime.MemStats
runtime.ReadMemStats(&stats)
fmt.Printf("alloc: %d\n", stats.Alloc)

It reports a heap size of 524 MB for me. That matches the expectation and documentation for make(chan t, size).

@jgaskins
Copy link
Contributor

On my machine it reports 2.7MB (alloc: 2874712) if I use pointers in the channels, 12MB (alloc: 12833624) for structs.

@beta-ziliani
Copy link
Member

I don't know how Go does it, but if your app requires seconds to process, and might have a large spike, waiting for the allocator sounds like a bad idea. This said, I don't oppose a more flexible API, but I'd prefer the default to be as it is now.

BTW, I made some tests of my own to test the seemingly contradicting measurements we saw here:

  1. With --release, I see the same measurements as @straight-shoota (perfectly linear in the size of the array, around half GB every 1000 channels).
  2. Go takes a handful MB for me, like @jgaskins says.
  3. If capacity is set to 0 (equal to no pre-allocation), then Crystal takes less than 2MB (for 1000 channels).

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 13, 2022

Could it be that the reason that go underallocates (at least) on some systems is that it doesn't zero the memory that is allocated to the buffers it use? On linux, that can make a lot of change as by default the OS wont actually allocate allocated pages that has never been written to. So it is possible it could be a mac/linux difference.

@straight-shoota
Copy link
Member

Yes, I'm suspecting something like that as well. However, there are some odds.
I would expect MemStats.Alloc to show the total amount that Go's memory manager allocated from the OS, regardless of whether the OS did actually allocate the memory pages. So the ridiculously low readings there seem odd. It could explain low memory usage metrics reported by the OS, though.

My report of 524 MB (#11572 (comment)) is on linux.
I suspect @beta-ziliani was on MacOS. @jgaskins what did you use?

@jgaskins
Copy link
Contributor

macOS. I hadn’t considered the possibility of the OS doing alloc-on-write, but that could indeed explain the differences we were seeing in memory usage at the OS level. I also agree that it’s surprising that Go’s allocator didn’t track it.

@jgaskins
Copy link
Contributor

if your app requires seconds to process, and might have a large spike, waiting for the allocator sounds like a bad idea

@beta-ziliani Time spent in the allocator is a rounding error compared to processing time, and is a fixed cost spread out over time. Don't get me wrong, I'm all for making things ridiculously, unbelievably, and sometimes unnecessarily fast. 😂 But that performance comes at a cost here.

Since discussions of performance aren't useful without benchmarks, I wrote one that measures the amount of time it takes to populate a lazy channel vs an eager channel from empty. This is the worst-case scenario for the lazy-allocated channel (nearly everything here works against the proposed solution) and does not show amortized cost over time. That is, after this initial cost, they will perform identically. Note that this code doesn't use Benchmark.ips to try to measure only the amount of time it takes to populate the channel, which should show the largest disparity between them. The GC.realloc calls from resizing the buffer still happen inside the Benchmark.measure block.

Benchmark code
require "benchmark"

measurements = {} of String => Array(Benchmark::BM::Tms)
iterations = 1_000
channel_depth = 64_000

Array.new(17) { |i| 2 ** i }.each do |channel_depth|
  pp channel_depth: channel_depth
  {LazyBufferChannel(Message), ::Channel(Message)}.each do |type|
    measurements[type.name] = Array(Benchmark::BM::Tms).new(iterations)

    iterations.times do
      channel = type.new(65536)
      GC.collect
      measurement = Benchmark.measure do
        channel_depth.times do
          channel.send Message.new(
            subject: "test subject",
            body: "hello".to_slice,
          )
        end
      end

      measurements[type.name] << measurement
    end
  end

  stats = measurements.transform_values do |m|
    m.sort_by!(&.total)
    {
      avg: m.sum(&.total).seconds / m.size,
      p50: "#{m[m.size // 2].total.humanize}s",
      p90: "#{m[(m.size * 0.9).to_i].total.humanize}s",
      p99: "#{m[(m.size * 0.99).to_i].total.humanize}s",
      max: "#{m[-1].total.humanize}s",
    }
  end

  pp stats
end

class LazyBufferChannel(T) < ::Channel(T)
  def initialize(@capacity = 0)
    @closed = false

    @senders = Crystal::PointerLinkedList(Sender(T)).new
    @receivers = Crystal::PointerLinkedList(Receiver(T)).new

    if capacity > 0
      @queue = Deque(T).new
    end
  end
end

struct Message
  getter subject : String
  getter body : Bytes
  getter reply_to : String?
  getter headers : Headers?

  alias Headers = Hash(String, String)

  def initialize(@subject, @body, @reply_to = nil, @headers = nil)
  end
end
Results
{channel_depth: 1}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000016763,
   p50: "13.0µs",
   p90: "32.0µs",
   p99: "76.0µs",
   max: "112µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000007571,
   p50: "4.0µs",
   p90: "18.0µs",
   p99: "35.0µs",
   max: "62.0µs"}}
{channel_depth: 2}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000009396,
   p50: "5.0µs",
   p90: "22.0µs",
   p99: "47.0µs",
   max: "93.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000007171,
   p50: "3.0µs",
   p90: "18.0µs",
   p99: "36.0µs",
   max: "46.0µs"}}
{channel_depth: 4}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000008975,
   p50: "5.0µs",
   p90: "22.0µs",
   p99: "45.0µs",
   max: "78.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000006967,
   p50: "3.0µs",
   p90: "18.0µs",
   p99: "38.0µs",
   max: "53.0µs"}}
{channel_depth: 8}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000010295,
   p50: "4.0µs",
   p90: "27.0µs",
   p99: "50.0µs",
   max: "76.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000007220,
   p50: "3.0µs",
   p90: "19.0µs",
   p99: "38.0µs",
   max: "45.0µs"}}
{channel_depth: 16}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000011313,
   p50: "6.0µs",
   p90: "28.0µs",
   p99: "46.0µs",
   max: "69.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000007498,
   p50: "3.0µs",
   p90: "19.0µs",
   p99: "34.0µs",
   max: "47.0µs"}}
{channel_depth: 32}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000012855,
   p50: "6.0µs",
   p90: "31.0µs",
   p99: "56.0µs",
   max: "74.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000007805,
   p50: "3.0µs",
   p90: "20.0µs",
   p99: "36.0µs",
   max: "46.0µs"}}
{channel_depth: 64}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000014481,
   p50: "7.0µs",
   p90: "35.0µs",
   p99: "55.0µs",
   max: "87.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000008797,
   p50: "3.0µs",
   p90: "24.0µs",
   p99: "42.0µs",
   max: "49.0µs"}}
{channel_depth: 128}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000016847,
   p50: "7.0µs",
   p90: "43.0µs",
   p99: "59.0µs",
   max: "70.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000011465,
   p50: "6.0µs",
   p90: "28.0µs",
   p99: "49.0µs",
   max: "60.0µs"}}
{channel_depth: 256}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000020807,
   p50: "8.0µs",
   p90: "49.0µs",
   p99: "71.0µs",
   max: "83.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000015415,
   p50: "7.0µs",
   p90: "39.0µs",
   p99: "56.0µs",
   max: "72.0µs"}}
{channel_depth: 512}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000026696,
   p50: "12.0µs",
   p90: "58.0µs",
   p99: "74.0µs",
   max: "92.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000019087,
   p50: "7.0µs",
   p90: "45.0µs",
   p99: "69.0µs",
   max: "82.0µs"}}
{channel_depth: 1024}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000035357,
   p50: "21.0µs",
   p90: "67.0µs",
   p99: "84.0µs",
   max: "93.0µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000025538,
   p50: "11.0µs",
   p90: "57.0µs",
   p99: "78.0µs",
   max: "98.0µs"}}
{channel_depth: 2048}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000046796,
   p50: "32.0µs",
   p90: "78.0µs",
   p99: "96.0µs",
   max: "115µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000035632,
   p50: "20.0µs",
   p90: "69.0µs",
   p99: "86.0µs",
   max: "113µs"}}
{channel_depth: 4096}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000070418,
   p50: "55.0µs",
   p90: "103µs",
   p99: "121µs",
   max: "139µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000051720,
   p50: "37.0µs",
   p90: "85.0µs",
   p99: "112µs",
   max: "124µs"}}
{channel_depth: 8192}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000118550,
   p50: "105µs",
   p90: "151µs",
   p99: "172µs",
   max: "197µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000084101,
   p50: "70.0µs",
   p90: "117µs",
   p99: "134µs",
   max: "178µs"}}
{channel_depth: 16384}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000216305,
   p50: "205µs",
   p90: "248µs",
   p99: "270µs",
   max: "282µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000153760,
   p50: "139µs",
   p90: "187µs",
   p99: "207µs",
   max: "263µs"}}
{channel_depth: 32768}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000414191,
   p50: "407µs",
   p90: "448µs",
   p99: "470µs",
   max: "576µs"},
 "Channel(Message)" =>
  {avg: 00:00:00.000291276,
   p50: "275µs",
   p90: "325µs",
   p99: "350µs",
   max: "467µs"}}
{channel_depth: 65536}
{"LazyBufferChannel(Message)" =>
  {avg: 00:00:00.000822434,
   p50: "820µs",
   p90: "874µs",
   p99: "949µs",
   max: "1.01ms"},
 "Channel(Message)" =>
  {avg: 00:00:00.000561991,
   p50: "547µs",
   p90: "594µs",
   p99: "614µs",
   max: "867µs"}}

@jgaskins
Copy link
Contributor

It's also worth noting that it's deemed acceptable to wait on the allocator for things like String.build and calling from_json on every string, array, and hash, which allocate buffers exactly as @oprypin and I are suggesting for channel queues. And we have to pay that cost pretty frequently. For example, every single Crystal web framework that I've looked at (except Armature) spins up a fresh string buffer to write the response to, which is allocated exactly like this lazy Channel idea.

@straight-shoota
Copy link
Member

straight-shoota commented Jun 15, 2022

I'm still a bit confused about the observed behaviour in Golang (when lazy allocating kicks in and when not).

But I can easily recreate it in Crystal by allocating the channel buffer with mmap instead of malloc. mmap reserves the address space but does not allocate all the physical memory (only the first page or so). The rest will be allocated when its actually accessed. We're already using the same mechanism for fiber stacks in Crystal which grow when used.

without patch:
Maximum resident set size (kbytes): 512228
with patch:
Maximum resident set size (kbytes): 4668
patch
diff --git i/src/channel.cr w/src/channel.cr
index 04a7b8b67..b7c3a2577 100644
--- i/src/channel.cr
+++ w/src/channel.cr
@@ -175,7 +175,9 @@ class Channel(T)
     @receivers = Crystal::PointerLinkedList(Receiver(T)).new

     if capacity > 0
-      @queue = Deque(T).new(capacity)
+      byte_size = capacity * sizeof(T)
+      buffer = LibC.mmap(nil, byte_size, LibC::PROT_READ | LibC::PROT_WRITE, LibC::MAP_PRIVATE | LibC::MAP_ANON, -1, 0)
+      @queue = Deque(T).new(buffer.as(Pointer(T)), capacity, 0)
     end
   end

diff --git i/src/deque.cr w/src/deque.cr
index 214e55044..65df4ad2a 100644
--- i/src/deque.cr
+++ w/src/deque.cr
@@ -52,6 +52,9 @@ class Deque(T)
     end
   end

+  def initialize(@buffer : Pointer(T), @capacity, @size)
+  end
+
   # Creates a new `Deque` of the given size filled with the same value in each position.
   #
   # ```

This patch is just a PoC.
It works in Golang without any special handling at the call site because big allocations are already implemented lazily in the runtime's custom memory management.

I think we should probably consider a similar approach for Crystal where large allocations (in Golang > 32KiB) use mmap and thus can be lazily allocated.
This would trivially solve this issue.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 15, 2022

@straight-shoota Hmm, is the memory collected by the GC if allocated that way?

I guess boehm allocation zeroes the memory as it would be necessary if the memory is reused instead of allocated from scratch.

EDIT: Looking at the code in boehm, the BZERO happen on certain conditions that I don't really understand. So perhaps it doesn't zero it.

@straight-shoota
Copy link
Member

Hmm, is the memory collected by the GC if allocated that way?

No. That's why it's just a PoC 😆 Making this production ready requires integration with the garbage collector. And probably some kind of memory management on our end.

@carlhoerberg
Copy link
Contributor

Boehm can be configured to allocate with mmap instead of sbrk, would that have a similar effect?

@carlhoerberg
Copy link
Contributor

Update, no it didn't have any effect.

@straight-shoota
Copy link
Member

BDWGC zeros out allocated memory, hence it accesses all of it.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 16, 2022

I guess it simplifies the GC a lot to assume that all allocated memory should be looked at, as otherwise it would have to keep track of what has been written to. I wonder how go solves that

@ysbaddaden
Copy link
Contributor

we should probably consider a similar approach for Crystal where large allocations (in Golang > 32KiB) use mmap

This is basically taking over the "Large Objects" space of the GC.

Hmm, is the memory collected by the GC if allocated that way?

It would if we report it as a root in the GC.before_collect callback, which means we need to keep a list of them, and their current size (so we only read what's significant). This is exactly what we do for Fiber stacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants