Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: Fix array growth thresholding #32035

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented May 15, 2019

To address problem (1) in #28588 (comment): Fixing the quadratic total insertion time complexity.

In this PR, currently I just always grow at a rate of 1.5x (after the number of elements is greater than 10). Happy to explore other options!

Closes #8269.

NHDaly added 4 commits May 6, 2019 19:05
Instead of capping array growth to a constant (currently 1% of physical
RAM), which leads to quadratic growth complexity, we simply lower the
_rate_ of growth when we hit a threshold, maintaining the ammortized
linear growth rate, but trading off slightly more CPU usage for more
memory usage.

This commit also increases the triggering threshold from 1% of physical
RAM to 30% of physical RAM.
This should be the fastest possible option, I think, at the expense of
larger memory use.

On my machine, compared with master, this brings the time to push! 2^30
elements one at a time onto a 2^30-size array down from 1000s to 18s.

The results of that calculation are below:

Benchmark:
```julia
for s in (2^27, 2^30,)
    vs = samerand(s)
    g["push_single_large!", s] = @benchmarkable push!(x, $(samerand())) setup=(x = copy($vs))
    g["push_multiple_large!", s] = @benchmarkable perf_push_multiple!(x, $vs) setup=(x = copy($vs))
end
```
Results on `master`:
```
  ("push_single_large!",    134217728) => Trial(768.667 ms)
  ("push_multiple_large!",  134217728) => Trial(6.169 s)
  ("push_single_large!",   1073741824) => Trial(28.277 s)
  ("push_multiple_large!", 1073741824) => Trial(1069.542 s)
```
Results after this commit:
```
  ("push_single_large!",    134217728) => Trial(766.012 ms)
  ("push_multiple_large!",  134217728) => Trial(2.211 s)
  ("push_single_large!",   1073741824) => Trial(20.854 s)
  ("push_multiple_large!", 1073741824) => Trial(18.894 s)
```

(Note though that the very large numbers fluctuate wildly, since the
process takes more virtual memory than I have physical memory, and is
swapping out to disk.)
- has a small cutoff at 10 elements, because for small numbers 1.5x
grows very often. (For example, 1*1.5 == 1, 2*1.5 == 3, 3*1.5 == 4)
@NHDaly NHDaly changed the title Nhdaly/array grow end memory growth threshold [WIP]: Fix array growth thresholding May 15, 2019
@KristofferC KristofferC added the needs nanosoldier run This PR should have benchmarks run on it label May 15, 2019
@JeffBezanson
Copy link
Member

It seems to me there should still be some kind of limit; 1% of RAM is just really stingy.

@NHDaly
Copy link
Member Author

NHDaly commented May 15, 2019

In case you didn't see it, I've summarized my latest thoughts in this comment:
#28588 (comment)

My main concern is that there should never be a constant-size growth increment; it should always be a scaling growth factor. As long as we do that, I'm happy.

I think we could consider lowering that growth factor based on the array size (to address your suggestion), but doing will still (I think) have implications on the time complexity. And given the benefits of virtual memory, it doesn't seem to actually be so terrible to let arrays grow larger than physical RAM anyway. Interested to hear your thoughts! :)

@JeffBezanson
Copy link
Member

Ok, that's a good argument. I'm ok with just doing 1.5x growth for now. I'd also really like to use realloc again. Large allocations are highly likely to be well aligned, so the try-and-check approach seems safe enough to me.

@StefanKarpinski
Copy link
Member

Isn't the golden ratio supposed to be the optimal growth factor?

@JeffBezanson
Copy link
Member

Yes, the golden ratio is optimal, and I suppose we don't mind using floating point, so we could use it. I wonder why it's not often used in practice though. Maybe you want some wiggle room in case the new size doesn't exactly fit for implementation reasons, or in case part of the old space has been allocated to something else?

@KristofferC
Copy link
Member

Ref #16305

@JeffBezanson
Copy link
Member

Ah yes. That discussion also reminds me that we should grow faster (e.g. 2x) up to some not-too-big, not-too-small threshold like 1000. Under that size wasting memory doesn't matter as much, but doing less reallocation will probably bring measurable speedups.

@vtjnash
Copy link
Member

vtjnash commented May 17, 2019

Yes, fwiw, we discussed during JuliaCon using some other growth function like n + k sqrt n or n + k log n, because they can be made to exhibit the behavior you describe but with smoother cut-off points. It's then straight forward to pick the cross-over point such that it strictly superior to the current exponential growth strategy for all finite array sizes.

@oscardssmith
Copy link
Member

@JeffBezanson The main reason the golden ratio is typically not used is that arrays are not that frequently grown, and if anything gets allocated in between it would go badly. IIRC, python uses 1.125 because they looked at a bunch of real world code and found that repeated pushes aren't that frequent.

@JeffBezanson
Copy link
Member

I agree that it's probably rare to do lots of unpredictable pushes to an array over its lifetime, but the case I worry about is where push! is used to initially populate an array. In that case growing too slowly to the final size can hurt performance quite a bit. 1.125 seems a bit too small for that case.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 22, 2019

How about using powers of two until the array size is an entire OS page? That should help avoid fragmentation since powers of two are easier to pack and it makes the growth for small dicts higher, addressing what Jeff is concerned about. Once the array size is ≥ a page, one can use any growth fact at all since it will always be in terms of whole pages.

@NHDaly
Copy link
Member Author

NHDaly commented May 22, 2019

@JeffBezanson The main reason the golden ratio is typically not used is that arrays are not that frequently grown, and if anything gets allocated in between it would go badly. IIRC, python uses 1.125 because they looked at a bunch of real world code and found that repeated pushes aren't that frequent.

Makes sense. Thanks for the explanation! :)

Yes, fwiw, we discussed during JuliaCon using some other growth function like n + k sqrt n or n + k log n, because they can be made to exhibit the behavior you describe but with smoother cut-off points. It's then straight forward to pick the cross-over point such that it strictly superior to the current exponential growth strategy for all finite array sizes.

@vtjnash yeah, i was really excited about that at the time too. But after thinking about it more, I now think it's a bad idea, per the reasons i wrote in #28588 (comment), section "Changing growth factor based on physical RAM": Basically i think that if you shrink (even slightly) the size you grow by after each growth, you will end up with an ammortized insertion time bigger than O(1), I think probably O(n). I think you need to keep the growth rate constant in order to stay ammortized constant insert time.

How about using powers of two until the array size is an entire OS page? That should help avoid fragmentation since powers of two are easier to pack and it makes the growth for small dicts higher, addressing what Jeff is concerned about. Once the array size is ≥ a page, one can use any growth fact at all since it will always be in terms of whole pages.

This seems like a reasonable idea to me! I'll think about it more.


Also, lemme add @tveldhui for some more input. He's thought about this a bit as well.

@tveldhui
Copy link

Julia is designed to be high performance. To my mind, doubling the array size is most consistent with high performance, it reduces copies per element in situations where realloc doesn't just do MREMAP.

The physical memory limit seems a red herring. There aren't many people who are going to accidentally coincidentally create one single array that approaches the size of physical memory. Most people working on big interesting problems have many data structures of varying sizes.

But consider the case where someone does have an interesting problem that requires one huge vector. If you picture a log-scale graph of problem sizes, the ones right around physical memory size are a razor-thin transition. If your problem fits in memory, you want to minimize copying for performance reasons. As you approach physical memory size you've got swap and overcommit to fall back on. If you're tackling problems bigger than main memory and determined to use a single Array/Vector, you're going to be relying on NVMe swap or somesuch.

From what I can tell, the main motivation for picking a smaller growth size than 2x, or having one that trails off, is to allow VERY LARGE arrays get slightly closer to the physical RAM limit without OOMing on machines that don't have any swap available. This seems like a very rare use-case to me, and to me, it doesn't make sense to penalize 99.9% of use cases by choosing a lower-performance resizing heuristic to help a tiny fraction of users tackle marginally bigger problems before they OOM.

@NHDaly
Copy link
Member Author

NHDaly commented May 22, 2019

Ah, yeah, that's a good point. My main motivation for suggesting 1.5x instead of 2x was to avoid this packing problem leaving holes in memory. But I guess if we tackle this at the same time as fixing realloc to actually do realloc, then the packing problem might not even be a real problem anymore, in which case I'm also down for just constant 2x growth.

I'll try to add the realloc fix to this PR, and then maybe get some graphs of memory use and CPU use for different growth factors?

@oscardssmith
Copy link
Member

@tveldhui the counter argument would be that frequent array growth is rare in performance critical work. (And if it matters, you can always manually set capacity). As a result the place memory efficiency matters most is large numbers of small arrays, which to me implies a fairly small growth factor consistently.

@NHDaly
Copy link
Member Author

NHDaly commented May 22, 2019 via email

@timholy
Copy link
Member

timholy commented May 22, 2019

@oscardssmith

IIRC, python uses 1.125 because they looked at a bunch of real world code and found that repeated pushes aren't that frequent.

I doubt that's directly translatable. In a slow language like Python you don't want to create arrays element-by-element because the interpreter cost kills you---you do everything you can to create arrays all at once. Julia doesn't have that overhead, so there is less disincentive to use push!. FWIW, there are currently 619 uses of push! in base/ alone; I didn't take the time to see how many of them occur inside a loop, but I bet it's not small.

@tveldhui
Copy link

@tveldhui the counter argument would be that frequent array growth is rare in performance critical work. (And if it matters, you can always manually set capacity). As a result the place memory efficiency matters most is large numbers of small arrays, which to me implies a fairly small growth factor consistently.

We're implementing an ultra high performance database engine/machine learning system in Julia. I work on database queries. (And for the record I am in love with Julia for this, we are outperforming the very best database engines single-core.) As you evaluate a query you're often doing vector push! to accumulate the results. With meticulous sampling and hairy algorithms you can approximate the expected result size, but that doesn't help you set an appropriate size for the vector of results - if your estimate is slightly under you still have to resize. Or you can go for an upper bound on the result size, which is computationally intensive and can be off by orders of magnitude (e.g. you reserve 100 times more space than needed). Database theory is a sticky wicket.

So I disagree very much that frequent array growth is rare and that you can predetermine capacity. Most of our cpu time is spent in queries where we cannot predict in advance how many results will be produced, so a fast push! is critical.

return alen + inc + a->offset + (jl_arr_xtralloc_limit / es);
}
return newlen;
return curlen <= 10 ? curlen * 2 : curlen * 1.5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters much, but in #16305 I used ((curlen*3)>>1) to avoid floating-point conversion here.

return alen + inc + a->offset + (jl_arr_xtralloc_limit / es);
}
return newlen;
return curlen <= 10 ? curlen * 2 : curlen * 1.5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the threshold be increased to 1000 as per Jeff's comment?

Copy link
Member

@stevengj stevengj Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it should be more like curlen * elsz <= 4096, e.g. about the typical page size (or I guess we could get the actual page size with sysconf).

@musm musm added the arrays [a, r, r, a, y, s] label Dec 15, 2020
@musm
Copy link
Contributor

musm commented Dec 15, 2020

@NHDaly any interest in picking this up again? Almost all the arguments are in favor of growing the arrays by a factor less than 2x, and based on this and the previous PR a 1.5x growth factor was deemed a reasonable compromise.

@eschnett
Copy link
Contributor

I hate to be that guy, but I notice that none of the arguments presented here are supported by either memory usage or run time benchmarks...

Benchmarks could include building Julia, running the test cases, or some sparse matrix algebra that might use push! to populate sparse matrices.

@KristofferC
Copy link
Member

I think this is closed via #40453.

@KristofferC KristofferC closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] needs nanosoldier run This PR should have benchmarks run on it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grow arrays by a factor less than 2?