You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-27468][core] Track correct storage level and mem/disk usage for RDDs.
Two things are being fixed here. The first, explicitly explained in the
referenced bug, is the storage level tracked for RDDs and partitions.
Previously, the RDD level would change depending on the status reported
by executors for the block they were storing, and individual blocks would
reflect that. That is wrong because different blocks may be stored differently
in different executors.
So now the RDD tracks the user-provided storage level, while the individual
partitions reflect the current storage level of that particular block,
including the current number of replicas.
The second fix is in the accounting of usage: block managers report the
current mem and disk used by the block, not the change from before. So
the status listener needs to track the previous usage of the blocks so
that it can accurately calculate the changes. This requires a bit more
memory to be used in the driver, but tests show it's not that big of a
problem (a few MB for a 100k-partition RDD with all blocks cached).
Some internal accounting was changed to save some memory, given the extra
usage incurred by the above tracking.
For reference, mem usage comparison (captured using jvisualvm) for 100k entries:
- Scala HashMap[String, LiveRDDBlock]: 17MB
- Scala HashMap[Int, LiveRDDBlock]: 11MB
- OpenHashMap[Int, LiveRDDBlock]: 6MB
- OpenHashMap[String, LiveRDDBlock]: 14MB
So using an OHM when you have primitive keys saves a lot of space. When you
have non-primitive keys, the savings don't add up to much, so maps that need
string keys were left untouched.
The unit tests were also changed to reflect the actual behavior of the
block manager when sending update events to the driver.
0 commit comments