Skip to content

[SPARK-4808] Removing minimum number of elements read before spill check #4420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mingyukim
Copy link
Contributor

In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mingyukim
Copy link
Contributor Author

This is following up @andrewor14's comments on #3656. It makes the threshold and frequency configurable rather than completely removing them. Please let me know if I should add documentation for these configurations as well!

@andrewor14
Copy link
Contributor

Thanks @mingyukim. On second thought I'm not so sure if we should make these configurable, since doing so kind of pushes the responsibility to the user to experiment with finding the optimal value for these super advanced tuning parameters himself. I think we should explore other alternatives like using the memory size as an additional heuristic instead of introducing highly advanced configs that we can't remove in the future because of backward compatibility reasons.

@mingyukim
Copy link
Contributor Author

Can you elaborate on the "memory size as an additional heuristic" idea? This is consistently causing OOMs in one of our workflows, which is exactly what spilling to disk is supposed to handle. I'm happy to work on it on my end if you have suggestions.

A few ideas off the top of my head are,

  • Have a threshold on {currentMemory - myMemoryThreshold} value so it tries to spill if the difference gets big enough.
  • In fact, why not remove the entire threshold check just like how it was originally suggested in [SPARK-4808] Remove Spillable minimum threshold and sampling #3656? You can tweak how often the spill is done by setting a minimum on the amount of memory you request from ShuffleMemoryManager. Then, you're guaranteed that the spill files are not too small. You still get too many files? Well.. that's unavoidable. Your shuffle is really big, so you'd have to spill a lot. Otherwise, your JVM will OOM. Basically, I don't think trackMemoryThreshold and trackMemoryFrequency are the right way to control your spill frequency or spill file size, since it can lead to OOMs when each element is large.

@srowen
Copy link
Member

srowen commented Feb 9, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27095 has started for PR 4420 at commit 84afd10.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27095 has finished for PR 4420 at commit 84afd10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27095/
Test PASSed.

@andrewor14
Copy link
Contributor

@mingyukim we can't just remove the check because this is a hot code path. Otherwise we'll try to access a synchronized block inside tryToAcquire every single element (not partition) and that's really expensive. I think a threshold that looks something like currentMemory - myMemoryThreshold might make sense as a simple fix.

@mingyukim
Copy link
Contributor Author

To be clear, I don't mean to do spill on every element. I mean to do it only when currentMemory > myMemoryThreshold. When currentMemory > myMemoryThreshold, you have two options. 1) Not try spilling and wait until you have enough elements cumulated since the last spill, which is what it's doing now, or 2) try spilling in order to avoid potential OOMs. Since spill is a feature that prevents executors from OOMing, I believe 2 is the right way.

I agree with you that spill shouldn't be done too frequently. However, # of elements is a wrong proxy to use for controlling the frequency of spills because you never know how large each element will be. # of bytes is a better proxy to use here, so the Spillable data structure cumulates at least N bytes worth of elements (as opposed to at least M elements) before spilling again. You can do this by acquiring at least N bytes from shuffleMemoryManager at a time.

What do you think?

@mccheah
Copy link
Contributor

mccheah commented Feb 16, 2015

@andrewor14 what do you think about the comments from @mingyukim ?

I also want to note that this should not need to be set to another value in a majority of cases. This should only need to be tweaked in the convoluted case where individual RDD items are large. I wouldn't expect this to be set unless the user is hitting out-of-memory errors. If we explicitly state this in the documentation then users should be deterred from abusing the setting.

I do agree that a more robust approach to determining when to spill is ideal, if we can have that.

@pwendell
Copy link
Contributor

I wonder if maybe we could improve these heuristics instead of adding user flags (for many users it might be hard to figure out how to set these). The heuristic of skipping the first 1000 entries assumes that the entries are small - I actually don't see what we gain by skipping those entires, since we still perform the sampling during this time (the sampling is the only really expensive part of these heuristics). @andrewor14, do you remember why this is there?

Also, maybe for the first 32 elements we could check every element, then fall back to checking every 32 elements. This would handle the case where there are some extremely large objects.

@mccheah
Copy link
Contributor

mccheah commented Feb 18, 2015

If every single object is large though, then in that case after we've spilled the 32nd object, there would still be an OOM before we check for spilling again, right? I don't know if there's a completely OOM-proof solution aside from checking for spilling on every single element though.

@mingyukim
Copy link
Contributor Author

@pwendell, I mentioned above, but at a high level, wouldn't it be better to control the frequency of spills by how much memory you acquire from the shuffle memory manager at a time than by how often you check if spill is needed?

We can test out your proposal in the specific case we have if that's a less risky option.

@pwendell
Copy link
Contributor

@mccheah @mingyukim yeah, there isn't an OOM proof solution at all because these are all heuristics. Even checking every element is not OOM proof since memory estimation is itself a heuristic that involves sampling. My only concern with exposing knobs here is that users will expect us to support these going forward, even though we may want to refactor this in the future in a way where those knobs don't make sense anymore. It's reasonable users would consider it a regression if their tuning of those knobs stopped working.

So if possible, it would be good to adjust our heuristics to meet a wider range of use cases and then if we keep hearing more issues we can expose knobs. We can't have them meet every possible use case, since they are heuristics, but in this case I was wondering if we could have a strict improvement to the heuristics. @andrewor14 can you comment on whether this is indeed a strict improvement?

One of the main benefits of the new data frames API is that we will be able to have precise control over memory usage in a way that can avoid OOM's ever. But for the current Spark API we are using this more ad-hoc memory estimation along with some heuristics.

I'm not 100% against exposing knobs either, but I'd be interested if some simple improvements fix your use case.

@mingyukim
Copy link
Contributor Author

Thanks for the response. To be clear, I understand the hesitation with exposing knobs. So, my proposal was to throttle the frequency of spills by how much memory is acquired from the shuffle memory manager at a time (e.g. if you ask 100MB at a time, you won't have spill files smaller than 100MB), but I understand that this will also need some tuning depending on the executor heap size.

That said, I'm trying to check if your simpler proposal of effectively setting trackMemoryThreshold=0 will fix the particular workflow we have. If that fixes our problem and @andrewor14 says this is good to go, I'm fine with this as a fix for now.

@andrewor14
Copy link
Contributor

@mingyukim @mccheah
Just to clarify, the reason we have the every N elements check is not to avoid spilling every element (unlikely), but to avoid entering the synchronized block every element. The latter is potentially expensive if we have multiple threads contending for the shuffle memory manager's attention at the same time.

I think the original motivation for trackMemoryThreshold is to optimize the case with very small partitions. In such cases we currently don't even try to enter the synchronized block at all, the potential cost of which could be a non-trivial fraction of the relatively short aggregation time. However, this is all conjecture and I don't believe that this optimization is crucial for such workloads, so I don't see a particular reason to keep this threshold if removing it means that we can handle large items.

However, simply removing the threshold won't solve the general case. There could be a burst of large items in the middle of the stream, in which case it will still buffer too many (32) of them in memory. One potential fix for this is to check every N bytes as opposed to every X elements. This is cheap because we already have an estimate of the map size before and after an element is inserted.

@andrewor14
Copy link
Contributor

In case my previous message was too long, here's the TL;DR:

If you remove the trackMemoryThreshold here, I will merge this patch. However, for now we need to keep some notion of spacing out the checks (i.e. every 32 items) for performance reasons. In a separate patch, it may be worthwhile to check every N bytes instead.

@mccheah
Copy link
Contributor

mccheah commented Feb 19, 2015

It sounds to me that checking for every N bytes however is the most robust approach. We can implement this immediately if that's the right fix. I think this is what @mingyukim suggested. How does that sound, @mingyukim @pwendell @andrewor14?

@mingyukim
Copy link
Contributor Author

Great. Checking every N bytes is very similar to what I was suggesting. We'd be happy to do that now. What would you use for the value N? Would you make it configurable?

@andrewor14
Copy link
Contributor

I would not make it configurable for the same reason. I don't have a great sense of what the value should be off the top of my head. It would be good to deduce it from empirical data collected from real workloads of varying requirements (a few large items, many large items, mostly small items, very few small items etc.). By the way if you intend to do that please also change the description and the title on the PR.

@mccheah
Copy link
Contributor

mccheah commented Feb 20, 2015

Okay we need this soon on our side, so we will do the short term fix of removing the sample memory threshold and not make anything configurable. We will open another Spark ticket to do this more robustly by counting bytes, and push another pull request for that as a more permanent solution.

@mccheah mccheah force-pushed the memory-spill-configurable branch from 84afd10 to 6e2509f Compare February 20, 2015 00:45
We found that if we only start checking for spilling after reading 1000
elements in a Spillable, we would run out of memory if there are fewer
than 1000 elements but each of those elements are very large.

There is no real need to only check for spilling after reading 1000
things. It is still necessary, however, to mitigate the cost of entering
a synchronized block. It turns out in practice however checking for
spilling only on every 32 items is sufficient, without needing the
minimum elements threshold.
@mingyukim mingyukim changed the title [SPARK-4808] Configurable spillable memory threshold + sampling rate [SPARK-4808] Removing minimum number of elements read before spill check Feb 20, 2015
@SparkQA
Copy link

SparkQA commented Feb 20, 2015

Test build #27759 has started for PR 4420 at commit 6e2509f.

  • This patch merges cleanly.

@mccheah
Copy link
Contributor

mccheah commented Feb 20, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Feb 20, 2015

Test build #27760 has started for PR 4420 at commit 6e2509f.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor

Ok, LGTM. I will merge this once tests pass. Could you file another JIRA for the tracking based on bytes thing?

@mingyukim
Copy link
Contributor Author

Filed https://issues.apache.org/jira/browse/SPARK-5915 for the follow-up.

Thanks, Patrick and Andrew, for the discussions!

@mingyukim mingyukim closed this Feb 20, 2015
@mingyukim mingyukim reopened this Feb 20, 2015
@SparkQA
Copy link

SparkQA commented Feb 20, 2015

Test build #27759 has finished for PR 4420 at commit 6e2509f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27759/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 20, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
@asfgit asfgit closed this in 3be92cd Feb 20, 2015
asfgit pushed a commit that referenced this pull request Feb 20, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check

(cherry picked from commit 3be92cd)
Signed-off-by: Andrew Or <andrew@databricks.com>
@andrewor14
Copy link
Contributor

Ok I merged this into master 1.3 and 1.2 thanks guys.

@SparkQA
Copy link

SparkQA commented Feb 20, 2015

Test build #27760 has finished for PR 4420 at commit 6e2509f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27760/
Test PASSed.

@andrewor14
Copy link
Contributor

Hi all, I will revert this in branch 1.3 for now because this was technically merged after the first RC, and people may have already run performance tests on this. Per @pwendell it's safer to just back port this later than to potentially introduce a regression. I will mark this as such on the JIRA.

@mingyukim
Copy link
Contributor Author

That's fine with us. I'm not entirely familiar with the branching model here, but does that mean this will be merged into branch-1.3 after 1.3.0 is released, and branch-1.3 will later be merged into master?

@andrewor14
Copy link
Contributor

Actually, I discussed this with @pwendell more offline. We should avoid changing the performance between 1.3.0 and future 1.3.x releases in such a subtle way, as it may cause regressions in a few cases. Therefore it's best to just merge this change in 1.4.0 and revert the changes in all older branches. I will do so and update the JIRA accordingly.

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Feb 24, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mingyukim pushed a commit to palantir/spark that referenced this pull request Mar 6, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mccheah added a commit to palantir/spark that referenced this pull request Mar 27, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mccheah added a commit to palantir/spark that referenced this pull request May 15, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <mcheah@palantir.com>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants