-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-4808] Removing minimum number of elements read before spill check #4420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
This is following up @andrewor14's comments on #3656. It makes the threshold and frequency configurable rather than completely removing them. Please let me know if I should add documentation for these configurations as well! |
Thanks @mingyukim. On second thought I'm not so sure if we should make these configurable, since doing so kind of pushes the responsibility to the user to experiment with finding the optimal value for these super advanced tuning parameters himself. I think we should explore other alternatives like using the memory size as an additional heuristic instead of introducing highly advanced configs that we can't remove in the future because of backward compatibility reasons. |
Can you elaborate on the "memory size as an additional heuristic" idea? This is consistently causing OOMs in one of our workflows, which is exactly what spilling to disk is supposed to handle. I'm happy to work on it on my end if you have suggestions. A few ideas off the top of my head are,
|
ok to test |
Test build #27095 has started for PR 4420 at commit
|
Test build #27095 has finished for PR 4420 at commit
|
Test PASSed. |
@mingyukim we can't just remove the check because this is a hot code path. Otherwise we'll try to access a synchronized block inside |
To be clear, I don't mean to do spill on every element. I mean to do it only when I agree with you that spill shouldn't be done too frequently. However, # of elements is a wrong proxy to use for controlling the frequency of spills because you never know how large each element will be. # of bytes is a better proxy to use here, so the Spillable data structure cumulates at least N bytes worth of elements (as opposed to at least M elements) before spilling again. You can do this by acquiring at least N bytes from shuffleMemoryManager at a time. What do you think? |
@andrewor14 what do you think about the comments from @mingyukim ? I also want to note that this should not need to be set to another value in a majority of cases. This should only need to be tweaked in the convoluted case where individual RDD items are large. I wouldn't expect this to be set unless the user is hitting out-of-memory errors. If we explicitly state this in the documentation then users should be deterred from abusing the setting. I do agree that a more robust approach to determining when to spill is ideal, if we can have that. |
I wonder if maybe we could improve these heuristics instead of adding user flags (for many users it might be hard to figure out how to set these). The heuristic of skipping the first 1000 entries assumes that the entries are small - I actually don't see what we gain by skipping those entires, since we still perform the sampling during this time (the sampling is the only really expensive part of these heuristics). @andrewor14, do you remember why this is there? Also, maybe for the first 32 elements we could check every element, then fall back to checking every 32 elements. This would handle the case where there are some extremely large objects. |
If every single object is large though, then in that case after we've spilled the 32nd object, there would still be an OOM before we check for spilling again, right? I don't know if there's a completely OOM-proof solution aside from checking for spilling on every single element though. |
@pwendell, I mentioned above, but at a high level, wouldn't it be better to control the frequency of spills by how much memory you acquire from the shuffle memory manager at a time than by how often you check if spill is needed? We can test out your proposal in the specific case we have if that's a less risky option. |
@mccheah @mingyukim yeah, there isn't an OOM proof solution at all because these are all heuristics. Even checking every element is not OOM proof since memory estimation is itself a heuristic that involves sampling. My only concern with exposing knobs here is that users will expect us to support these going forward, even though we may want to refactor this in the future in a way where those knobs don't make sense anymore. It's reasonable users would consider it a regression if their tuning of those knobs stopped working. So if possible, it would be good to adjust our heuristics to meet a wider range of use cases and then if we keep hearing more issues we can expose knobs. We can't have them meet every possible use case, since they are heuristics, but in this case I was wondering if we could have a strict improvement to the heuristics. @andrewor14 can you comment on whether this is indeed a strict improvement? One of the main benefits of the new data frames API is that we will be able to have precise control over memory usage in a way that can avoid OOM's ever. But for the current Spark API we are using this more ad-hoc memory estimation along with some heuristics. I'm not 100% against exposing knobs either, but I'd be interested if some simple improvements fix your use case. |
Thanks for the response. To be clear, I understand the hesitation with exposing knobs. So, my proposal was to throttle the frequency of spills by how much memory is acquired from the shuffle memory manager at a time (e.g. if you ask 100MB at a time, you won't have spill files smaller than 100MB), but I understand that this will also need some tuning depending on the executor heap size. That said, I'm trying to check if your simpler proposal of effectively setting |
@mingyukim @mccheah I think the original motivation for However, simply removing the threshold won't solve the general case. There could be a burst of large items in the middle of the stream, in which case it will still buffer too many (32) of them in memory. One potential fix for this is to check every N bytes as opposed to every X elements. This is cheap because we already have an estimate of the map size before and after an element is inserted. |
In case my previous message was too long, here's the TL;DR: If you remove the |
It sounds to me that checking for every N bytes however is the most robust approach. We can implement this immediately if that's the right fix. I think this is what @mingyukim suggested. How does that sound, @mingyukim @pwendell @andrewor14? |
Great. Checking every N bytes is very similar to what I was suggesting. We'd be happy to do that now. What would you use for the value N? Would you make it configurable? |
I would not make it configurable for the same reason. I don't have a great sense of what the value should be off the top of my head. It would be good to deduce it from empirical data collected from real workloads of varying requirements (a few large items, many large items, mostly small items, very few small items etc.). By the way if you intend to do that please also change the description and the title on the PR. |
Okay we need this soon on our side, so we will do the short term fix of removing the sample memory threshold and not make anything configurable. We will open another Spark ticket to do this more robustly by counting bytes, and push another pull request for that as a more permanent solution. |
84afd10
to
6e2509f
Compare
We found that if we only start checking for spilling after reading 1000 elements in a Spillable, we would run out of memory if there are fewer than 1000 elements but each of those elements are very large. There is no real need to only check for spilling after reading 1000 things. It is still necessary, however, to mitigate the cost of entering a synchronized block. It turns out in practice however checking for spilling only on every 32 items is sufficient, without needing the minimum elements threshold.
Test build #27759 has started for PR 4420 at commit
|
Jenkins, test this please. |
Test build #27760 has started for PR 4420 at commit
|
Ok, LGTM. I will merge this once tests pass. Could you file another JIRA for the tracking based on bytes thing? |
Filed https://issues.apache.org/jira/browse/SPARK-5915 for the follow-up. Thanks, Patrick and Andrew, for the discussions! |
Test build #27759 has finished for PR 4420 at commit
|
Test PASSed. |
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch #3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch #3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check (cherry picked from commit 3be92cd) Signed-off-by: Andrew Or <andrew@databricks.com>
Ok I merged this into master 1.3 and 1.2 thanks guys. |
Test build #27760 has finished for PR 4420 at commit
|
Test PASSed. |
Hi all, I will revert this in branch 1.3 for now because this was technically merged after the first RC, and people may have already run performance tests on this. Per @pwendell it's safer to just back port this later than to potentially introduce a regression. I will mark this as such on the JIRA. |
That's fine with us. I'm not entirely familiar with the branching model here, but does that mean this will be merged into branch-1.3 after 1.3.0 is released, and branch-1.3 will later be merged into master? |
Actually, I discussed this with @pwendell more offline. We should avoid changing the performance between 1.3.0 and future 1.3.x releases in such a subtle way, as it may cause regressions in a few cases. Therefore it's best to just merge this change in 1.4.0 and revert the changes in all older branches. I will do so and update the JIRA accordingly. |
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch apache#3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch apache#3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch apache#3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch apache#3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.
However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.
Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.