Skip to content

Conversation

pranavdev022
Copy link

@pranavdev022 pranavdev022 commented Oct 17, 2025

What changes were proposed in this pull request?

This PR optimizes memory management for cached local relations when cloning Spark sessions by implementing reference counting instead of data replication.

Current behavior:

  • When a session is cloned, cached local relation data stored in the block manager is replicated.
  • Each clone creates a duplicate copy of the data with a new block ID.
  • This causes unnecessary memory pressure.

Proposed changes:

  • Implement reference counting for cached local relations during session cloning.
  • Retain the same block ID and data reference when cloning sessions, incrementing a ref count instead of copying
  • Add a hash-to-blockId mapping in ArtifactManager for efficient block lookup
  • Clean up blocks from block manager memory when ref count reaches zero

Why are the changes needed?

Cloning sessions is a common operation in Spark applications (e.g., for creating isolated execution contexts). The current approach of duplicating cached data can significantly increase memory footprint, especially when:

  • Sessions are cloned frequently
  • Cached relations contain large datasets
  • Multiple clones exist simultaneously

This optimization reduces memory pressure, improves performance by avoiding unnecessary data copies.

Does this PR introduce any user-facing change?

No. This is an internal optimization that improves memory efficiency without changing user-facing APIs or behavior.

How was this patch tested?

  • Added unit tests to verify the reference count logic functioning.
  • Existing unit tests for ArtifactManager and session cloning.

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant