Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Aug 3, 2025

Which issue does this PR close?

Rationale for this change

Currently, the DataFusion CLI has limited visibility into query memory usage. Users cannot easily track peak memory, cumulative allocations, or operator-level memory breakdowns. This PR introduces memory profiling features, providing actionable insights for debugging, tuning, and capacity planning. It also strengthens error reporting with clearer context about memory consumers.

What changes are included in this PR?

  • CLI Enhancements:

    • Added new command \memory_profiling to toggle profiling during CLI sessions.
    • Automatic memory usage report after queries when profiling is enabled.
    • Support for top-N memory consumer tracking.
  • Core Enhancements:

    • Introduced TrackedPool trait for memory pools with consumer tracking.
    • Added ConsumerMemoryMetrics and format_metrics for reporting.
    • Enhanced MemoryReservation to track peak usage and provide human-readable summaries.
    • Implemented operator-level categorization for reporting (e.g., Sorting, Aggregation).
  • Examples and Documentation:

    • Added memory_profiling.rs example demonstrating programmatic usage.
    • Updated CLI user guide with new memory profiling section.
  • Tests and Snapshots:

    • Added CLI integration tests for memory profiling enable/disable behavior.
    • Updated test snapshots for dynamic memory outputs.
    • Adjusted existing tests to accept both backtrace and planning error outputs.

Are these changes tested?

Yes:

  • Unit tests for TrackConsumersPool and memory metrics.
  • Integration tests for CLI commands (cli_memory_auto_report, cli_memory_disable_stops_report).
  • Example added to validate manual profiling.

Are there any user-facing changes?

Yes:

  • New CLI command: \memory_profiling (toggle profiling on/off).
  • New runtime option --top-memory-consumers N (default now 5, previously 3).
  • Enhanced error messages with memory consumer metrics.
  • New example memory_profiling.rs.
  • Updated CLI usage documentation.

No breaking API changes; all new features are additive.

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Aug 3, 2025
kosiew added 22 commits August 3, 2025 19:01
…y profiling and tracker fields in SessionStateBuilder
…blank lines

- Fixed the spelling of "Apche" to "Apache" in the license comment.
- Removed unnecessary blank lines in `no_grouping.rs`, `symmetric_hash_join.rs`, and `sort.rs`.
@kosiew kosiew requested a review from alamb August 21, 2025 11:04
@kosiew kosiew changed the title Add Memory Profiling Support to DataFusion CLI Add memory profiling support to DataFusion CLI and memory pool metrics Aug 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew -- I am sorry for the slow reviews, but I am really having trouble reviewing this PR -- I don't know why there are more unrelated changes that showed up here the changes to the CLI testing

Are you by any chance using a AI coding tool? If so, perhaps you can spend some time reviewing the PRs a bit more yourself and spit them into smaller PRs for review?

Some of the other changes in this PR are nice (like the change to tracked pool, potentially) but they seem unrelated to printing the contents of the memory

I am sorry for another round of feedback (this is like the 3rd time and there are already 280 commits 🤯 )

In general, I think we would have an easier time reviewing PRs (and get them merged faster) if you could keep them smaller and more focused

let lower_stdout = stdout.to_lowercase();
let lower_stderr = stderr.to_lowercase();

let has_planning_error = lower_stdout.contains("failed to coerce arguments")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand any of these changes -- why would we accept arrow cast errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image 🤦 This seemed to be from tests related to another branch related to cast_column for struct support. I am very sorry that I did not notice the unrelated change creeping into this PR.

/// Returns a string with peak usage, cumulative allocations, and totals per
/// operator category. The caller is responsible for printing the returned
/// string if desired.
pub fn format_metrics(metrics: &[ConsumerMemoryMetrics]) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this formatting stuff should be put in the CLI code for now (not the main repo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

) -> Result<DataFrame, DataFusionError>;

/// Return true if memory profiling is enabled.
fn memory_profiling(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the memory profiling flag would best be stored on PrintOptions, similarly to the quiet mode flag (that suppresses execution time printing). Then you would not need to introduce so much new code and a new trait

https://github.com/apache/datafusion/blob/df45d186d34f2ac131d64e4a068d9f39b35e99c7/datafusion-cli/src/print_options.rs#L73-L72

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require print_options to be mut because we can
toggle memory_profiling
It was moved out of print_options after a comment that print_options should not be mut for memory profiling.

@@ -1,9 +1,9 @@
---
source: tests/cli_integration.rs
source: datafusion-cli/tests/cli_integration.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change these files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These unrelated changes have been cleaned up.

@kosiew kosiew marked this pull request as draft August 25, 2025 12:32
@kosiew kosiew changed the title Add memory profiling support to DataFusion CLI and memory pool metrics DRAFT Add memory profiling support to DataFusion CLI and memory pool metrics Aug 25, 2025
@github-actions github-actions bot removed the core Core DataFusion crate label Aug 25, 2025
kosiew added 6 commits August 25, 2025 23:20
… documentation

Updated the default value for the `--top-memory-consumers` option from 3 to 5 in the CLI usage documentation to reflect the correct configuration.
…acking

Added detailed metrics collection for memory consumption in the memory profiling example. This includes peak memory and cumulative allocation bytes for each operator or execution stage, along with a summarized output by operator category.
@kosiew kosiew changed the title DRAFT Add memory profiling support to DataFusion CLI and memory pool metrics Add memory profiling support to DataFusion CLI and memory pool metrics Aug 26, 2025
@kosiew kosiew marked this pull request as ready for review August 26, 2025 07:40
@kosiew kosiew requested a review from alamb August 26, 2025 07:42
kosiew and others added 5 commits August 31, 2025 16:32
Resolved conflicts in memory pool implementation:
- Kept cumulative memory tracking feature in TrackedConsumer
- Updated test snapshots to match current error message format
- All conflicts resolved in favor of HEAD branch (memory-16904)
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation execution Related to the execution crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants