Skip to content

[SPARK-10983] Unified memory manager #9084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

andrewor14
Copy link
Contributor

This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:

  • spark.memory.fraction (default 0.75): ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
  • spark.memory.storageFraction (default 0.5): size of the storage region within the space set aside by s​park.memory.fraction. ​Cached data may only be evicted if total storage exceeds this region.
  • spark.memory.useLegacyMode (default false): whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.

For a detailed description of the design, see SPARK-10000. This patch builds on top of the MemoryManager interface introduced in #9000.

Andrew Or added 15 commits October 9, 2015 13:32
As of this commit, acquiring execution memory may cause cached
blocks to be evicted. There are still a couple of TODOs, notably
figuring out what the `maxExecutionMemory` and `maxStorageMemory`
should actually be. They may not be fixed since either side can
now borrow from the other.
Without this, we cannot both avoid deadlocks and race conditions,
because each individual component ShuffleMemoryManager,
MemoryStore and MemoryManager all have their own respective
locks.

This commit allows us to simplify several unintuitive control
flows that were introduced to avoid acquiring locks in different
orders. Since we have only one lock now, these code blocks can
be significantly simplified.

A forseeable downside to this is parallelism is reduced,
but memory acquisitions and releases don't actually occur that
frequently, so performance should not suffer noticeably.
Additional investigations about this should ensue.
Previously, ShuffleMemoryManager will allow each task to acquire
up to 1/N of the entire storage + execution region. What we want
is more like 1/N of the space not occupied by storage, since the
"max" now varies over time.
This happened because we were still calling `this.notifyAll()`
in ShuffleMemoryManager when we were holding the `memoryManager`
lock. Easy fix.
Tests are passing in this commit, but there are follow-ups that
need to be done (see TODOs added in this commit). More tests will
be added in the future.
As of this commit all *MemoryManagerSuite's are documented and
pass tests.
TaskContext was not stubbed correctly in ShuffleMemoryManagerSuite.
In particular, this patch added some code that does some things
to the active TaskMetrics, but this was not part of the mocked
TaskContext object.
@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43602 has finished for PR 9084 at commit 01ff533.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Deprecation message for memory fraction configs used in the old memory management model
private val deprecatedMemoryFractionMessage =
"As of Spark 1.6, execution and storage memory management are unified. " +
"All memory fractions used in the old model are now deprecated and no longer read."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old configurations will still be respected in legacy mode, so this is slightly ambiguous / confusing. Is there an easy way to avoid the warning if the legacy mode configuration is turned on? If not, I suppose we could just expand the deprecation message to mention this corner case, perhaps by just appending a "(unless spark.XX.YY is enabled)" at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can print a different warning if legacy mode is on

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43607 has finished for PR 9084 at commit 059ae0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -72,46 +92,62 @@ private[spark] abstract class MemoryManager {
def acquireUnrollMemory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that acquireUnrollMemory appears to act as a synonym for acquireStorageMemory in the current implementation, it might be worth adding a brief comment above this method to explain that this extra method exists in order to give us the future flexibility to account for unroll memory differently in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, it's more than a synonym. In StaticMemoryManager it's required to preserve existing behavior where unrolling doesn't evict all the blocks.

@JoshRosen
Copy link
Contributor

It looks like the test failures are occurring in the external sorter suites. Some of these tests exercise spilling-related logic (such as ensuring that spill files are cleaned up); in order to induce spilling, these tests configured the legacy memory fractions to be really small. In order to fix these tests, we can either enable legacy mode for those tests only or we can modify the tests to set the new configurations. Ideally these sorter unit tests wouldn't be relying on memory management behavior in order to test spill cleanup, but they were written a long time ago. We could refactor these tests, but I'd prefer a band-aid solution for now.

Many of these tests relied on setting a very low shuffle memory
fraction. Since that config is now deprecated and not read, the
tests now fail because things aren't spilling. This approach is,
however, quite brittle as it depends on the heap size of the test
JVM. Instead, we should explicitly set a limit on the memory used.
Andrew Or added 4 commits October 13, 2015 00:11
Unfortunately not all affected tests can be easily rewritten.
Instead, this commit adds TODO's in places where the tests may
not actually do anything so we can fix them in the future
(SPARK-11078).
@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43622 has finished for PR 9084 at commit 24a391c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43631 has finished for PR 9084 at commit face129.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43646 has finished for PR 9084 at commit face129.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #1890 has finished for PR 9084 at commit face129.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #1888 has finished for PR 9084 at commit face129.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #1889 has finished for PR 9084 at commit face129.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #1891 has finished for PR 9084 at commit face129.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

LGTM. I'm going to merge this now. There's some followup tasks to do in terms of refactoring some old spilling tests, but there's followup tasks for those and we'll do them soon.

@asfgit asfgit closed this in b3ffac5 Oct 13, 2015
@andrewor14 andrewor14 deleted the unified-memory-manager branch October 13, 2015 20:53
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Oct 14, 2015
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:

- **spark.memory.fraction (default 0.75)**: ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.

- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `s​park.memory.fraction`. ​Cached data may only be evicted if total storage exceeds this region.

- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.

For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in apache#9000.

Author: Andrew Or <andrew@databricks.com>

Closes apache#9084 from andrewor14/unified-memory-manager.
asfgit pushed a commit that referenced this pull request Oct 15, 2015
#9084 uncovered that many tests that test spilling don't actually spill. This is a follow-up patch to fix that to ensure our unit tests actually catch potential bugs in spilling. The size of this patch is inflated by the refactoring of `ExternalSorterSuite`, which had a lot of duplicate code and logic.

Author: Andrew Or <andrew@databricks.com>

Closes #9124 from andrewor14/spilling-tests.
@@ -40,6 +41,10 @@ private[spark] abstract class MemoryManager {
_memoryStore
}

// Amount of execution/storage memory in use, accesses must be synchronized on `this`
protected var _executionMemoryUsed: Long = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here is protected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzz2011 you're making several comments on old PRs. Generally people won't see that and it's not the place for discussion anyway. If you can formulate a specific question beyond "why is the code this way?" ask on user@.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants