-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]: Fix array growth thresholding #32035
[WIP]: Fix array growth thresholding #32035
Conversation
Instead of capping array growth to a constant (currently 1% of physical RAM), which leads to quadratic growth complexity, we simply lower the _rate_ of growth when we hit a threshold, maintaining the ammortized linear growth rate, but trading off slightly more CPU usage for more memory usage. This commit also increases the triggering threshold from 1% of physical RAM to 30% of physical RAM.
This should be the fastest possible option, I think, at the expense of larger memory use. On my machine, compared with master, this brings the time to push! 2^30 elements one at a time onto a 2^30-size array down from 1000s to 18s. The results of that calculation are below: Benchmark: ```julia for s in (2^27, 2^30,) vs = samerand(s) g["push_single_large!", s] = @benchmarkable push!(x, $(samerand())) setup=(x = copy($vs)) g["push_multiple_large!", s] = @benchmarkable perf_push_multiple!(x, $vs) setup=(x = copy($vs)) end ``` Results on `master`: ``` ("push_single_large!", 134217728) => Trial(768.667 ms) ("push_multiple_large!", 134217728) => Trial(6.169 s) ("push_single_large!", 1073741824) => Trial(28.277 s) ("push_multiple_large!", 1073741824) => Trial(1069.542 s) ``` Results after this commit: ``` ("push_single_large!", 134217728) => Trial(766.012 ms) ("push_multiple_large!", 134217728) => Trial(2.211 s) ("push_single_large!", 1073741824) => Trial(20.854 s) ("push_multiple_large!", 1073741824) => Trial(18.894 s) ``` (Note though that the very large numbers fluctuate wildly, since the process takes more virtual memory than I have physical memory, and is swapping out to disk.)
- has a small cutoff at 10 elements, because for small numbers 1.5x grows very often. (For example, 1*1.5 == 1, 2*1.5 == 3, 3*1.5 == 4)
It seems to me there should still be some kind of limit; 1% of RAM is just really stingy. |
In case you didn't see it, I've summarized my latest thoughts in this comment: My main concern is that there should never be a constant-size growth increment; it should always be a scaling growth factor. As long as we do that, I'm happy. I think we could consider lowering that growth factor based on the array size (to address your suggestion), but doing will still (I think) have implications on the time complexity. And given the benefits of virtual memory, it doesn't seem to actually be so terrible to let arrays grow larger than physical RAM anyway. Interested to hear your thoughts! :) |
Ok, that's a good argument. I'm ok with just doing 1.5x growth for now. I'd also really like to use realloc again. Large allocations are highly likely to be well aligned, so the try-and-check approach seems safe enough to me. |
Isn't the golden ratio supposed to be the optimal growth factor? |
Yes, the golden ratio is optimal, and I suppose we don't mind using floating point, so we could use it. I wonder why it's not often used in practice though. Maybe you want some wiggle room in case the new size doesn't exactly fit for implementation reasons, or in case part of the old space has been allocated to something else? |
Ref #16305 |
Ah yes. That discussion also reminds me that we should grow faster (e.g. 2x) up to some not-too-big, not-too-small threshold like 1000. Under that size wasting memory doesn't matter as much, but doing less reallocation will probably bring measurable speedups. |
Yes, fwiw, we discussed during JuliaCon using some other growth function like |
@JeffBezanson The main reason the golden ratio is typically not used is that arrays are not that frequently grown, and if anything gets allocated in between it would go badly. IIRC, python uses 1.125 because they looked at a bunch of real world code and found that repeated pushes aren't that frequent. |
I agree that it's probably rare to do lots of unpredictable pushes to an array over its lifetime, but the case I worry about is where |
How about using powers of two until the array size is an entire OS page? That should help avoid fragmentation since powers of two are easier to pack and it makes the growth for small dicts higher, addressing what Jeff is concerned about. Once the array size is ≥ a page, one can use any growth fact at all since it will always be in terms of whole pages. |
Makes sense. Thanks for the explanation! :)
@vtjnash yeah, i was really excited about that at the time too. But after thinking about it more, I now think it's a bad idea, per the reasons i wrote in #28588 (comment), section "Changing growth factor based on physical RAM": Basically i think that if you shrink (even slightly) the size you grow by after each growth, you will end up with an ammortized insertion time bigger than
This seems like a reasonable idea to me! I'll think about it more. Also, lemme add @tveldhui for some more input. He's thought about this a bit as well. |
Julia is designed to be high performance. To my mind, doubling the array size is most consistent with high performance, it reduces copies per element in situations where realloc doesn't just do MREMAP. The physical memory limit seems a red herring. There aren't many people who are going to accidentally coincidentally create one single array that approaches the size of physical memory. Most people working on big interesting problems have many data structures of varying sizes. But consider the case where someone does have an interesting problem that requires one huge vector. If you picture a log-scale graph of problem sizes, the ones right around physical memory size are a razor-thin transition. If your problem fits in memory, you want to minimize copying for performance reasons. As you approach physical memory size you've got swap and overcommit to fall back on. If you're tackling problems bigger than main memory and determined to use a single Array/Vector, you're going to be relying on NVMe swap or somesuch. From what I can tell, the main motivation for picking a smaller growth size than 2x, or having one that trails off, is to allow VERY LARGE arrays get slightly closer to the physical RAM limit without OOMing on machines that don't have any swap available. This seems like a very rare use-case to me, and to me, it doesn't make sense to penalize 99.9% of use cases by choosing a lower-performance resizing heuristic to help a tiny fraction of users tackle marginally bigger problems before they OOM. |
Ah, yeah, that's a good point. My main motivation for suggesting 1.5x instead of 2x was to avoid this packing problem leaving holes in memory. But I guess if we tackle this at the same time as fixing I'll try to add the realloc fix to this PR, and then maybe get some graphs of memory use and CPU use for different growth factors? |
@tveldhui the counter argument would be that frequent array growth is rare in performance critical work. (And if it matters, you can always manually set capacity). As a result the place memory efficiency matters most is large numbers of small arrays, which to me implies a fairly small growth factor consistently. |
Oscar, for many small vectors I would feel even more strongly about wanting
a larger growth factor.
There's obviously a CPU performance vs memory tradeoff: a smaller growth
rate would grow more often (more CPU expensive) in exchange for wasting
less memory from overallocating.
But from my perspective, the memory gains aren't that significant from a
smaller growth rate. In the 2x growth scenario, if you assume the absolute
worst case, every vector would be exacty twice as big as it should be.
So in the absolute best case from a memory perspective, you only have a
constant factor improvement to make by using a smaller rate.
But you pay the cpu cost _every time_ you grow the array, so the potential
cost there is large. And smaller arrays necessarily grow more often. So
with many small arrays a lot of your CPU would be spent in the allocator.
So for many small arrays, I would favor a larger growth rate even more
strongly, since I wouldn't want to be regrowing them all the time every few
inserts.
Does that seem reasonable?
…On Wed, May 22, 2019, 6:22 PM Oscar Smith ***@***.***> wrote:
@tveldhui <https://github.com/tveldhui> the counter argument would be
that frequent array growth is rare in performance critical work. (And if it
matters, you can always manually set capacity). As a result the place
memory efficiency matters most is large numbers of small arrays, which to
me implies a fairly small growth factor consistently.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#32035?email_source=notifications&email_token=AAMCIELWNTJG2GKVP4QXMLTPWXBSDA5CNFSM4HNEL7A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWAQUVA#issuecomment-494996052>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAMCIENCB53HXQGAMUTBZFDPWXBSDANCNFSM4HNEL7AQ>
.
|
I doubt that's directly translatable. In a slow language like Python you don't want to create arrays element-by-element because the interpreter cost kills you---you do everything you can to create arrays all at once. Julia doesn't have that overhead, so there is less disincentive to use |
We're implementing an ultra high performance database engine/machine learning system in Julia. I work on database queries. (And for the record I am in love with Julia for this, we are outperforming the very best database engines single-core.) As you evaluate a query you're often doing vector push! to accumulate the results. With meticulous sampling and hairy algorithms you can approximate the expected result size, but that doesn't help you set an appropriate size for the vector of results - if your estimate is slightly under you still have to resize. Or you can go for an upper bound on the result size, which is computationally intensive and can be off by orders of magnitude (e.g. you reserve 100 times more space than needed). Database theory is a sticky wicket. So I disagree very much that frequent array growth is rare and that you can predetermine capacity. Most of our cpu time is spent in queries where we cannot predict in advance how many results will be produced, so a fast push! is critical. |
return alen + inc + a->offset + (jl_arr_xtralloc_limit / es); | ||
} | ||
return newlen; | ||
return curlen <= 10 ? curlen * 2 : curlen * 1.5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it matters much, but in #16305 I used ((curlen*3)>>1)
to avoid floating-point conversion here.
return alen + inc + a->offset + (jl_arr_xtralloc_limit / es); | ||
} | ||
return newlen; | ||
return curlen <= 10 ? curlen * 2 : curlen * 1.5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the threshold be increased to 1000 as per Jeff's comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it should be more like curlen * elsz <= 4096
, e.g. about the typical page size (or I guess we could get the actual page size with sysconf
).
@NHDaly any interest in picking this up again? Almost all the arguments are in favor of growing the arrays by a factor less than 2x, and based on this and the previous PR a 1.5x growth factor was deemed a reasonable compromise. |
I hate to be that guy, but I notice that none of the arguments presented here are supported by either memory usage or run time benchmarks... Benchmarks could include building Julia, running the test cases, or some sparse matrix algebra that might use |
I think this is closed via #40453. |
To address problem (1) in #28588 (comment): Fixing the quadratic total insertion time complexity.
In this PR, currently I just always grow at a rate of 1.5x (after the number of elements is greater than 10). Happy to explore other options!
Closes #8269.