Skip to content

MetadataCleaner - fine control cleanup documentation #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 82 additions & 10 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -353,16 +353,6 @@ Apart from these, the following properties are also available, and may be useful
Port for the driver to listen on.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl</td>
<td>(infinite)</td>
<td>
Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
</td>
</tr>
<tr>
<td>spark.streaming.blockInterval</td>
<td>200</td>
Expand Down Expand Up @@ -487,6 +477,88 @@ Apart from these, the following properties are also available, and may be useful
</tr>
</table>


The following are the properties that can be used to schedule cleanup jobs at different levels.
The below mentioned metadata tuning parameters should be set with a lot of consideration and only where required.
Scheduling metadata cleaning in the middle of job can result in a lot of unnecessary re-computations.

<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td>spark.cleaner.ttl</td>
<td>(infinite)</td>
<td>
Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.MAP_OUTPUT_TRACKER</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id.
</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to add that this takes precedence over spark.cleaner.ttl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for rest ...

</tr>
<tr>
<td>spark.cleaner.ttl.SHUFFLE_MAP_TASK</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Clears up the cache used for shuffled tasks (tasks present in the earlier stages of the job) - a map that maps stageId to the serialised byte array of the shuffled task.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.RESULT_TASK</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Clears up the cache used to store the final tasks (tasks present in the last stage of the job) - a map that maps stageId to the serialised byte array of the final task.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.SPARK_CONTEXT</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Cleans up all the old persistent (cached) RDDs.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.HTTP_BROADCAST</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Cleans up all broadcast files which are timestamped older than the assigned cleanup value.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.DAG_SCHEDULER</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Clears up all the maps saved inside the DAG Scheduler such as - stageIdToStage, pendingTasks, stageIdToJobIds etc which are timestamped older than the assigned cleanup value.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.BLOCK_MANAGER</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Clears the old non broadcast blocks from memory.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.BROADCAST_VARS</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Clears the old broadcast blocks from memory.
</td>
</tr>
<tr>
<td>spark.cleaner.ttl.SHUFFLE_BLOCK_MANAGER</td>
<td>spark.cleaner.ttl, with a min. value of 10 secs</td>
<td>
Deletes the old physical files stored on the disk created as a result of shuffling transformations/actions such as a reduce job.
</td>
</tr>
</table>

## Viewing Spark Properties

The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
Expand Down