Skip to content

Conversation

@BlakeOrth
Copy link
Contributor

Which issue does this PR close?

This does not fully close, but is an incremental building block component for:

The full context of how this code is likely to progress can be seen in the POC for this effort:

Rationale for this change

This change allows a user of datafusion-cli to enable or disable object store profiling, either from the command line or from an internal command while in a cli session.

Note that since no actual instrumentation is implemented for the InstrumentedObjectStores the output for the user at this point is generally not interesting (yet). This PR implements a tapas-sized functional unit of work towards #17207

What changes are included in this PR?

  • Adds a CLI option and command to datafusion-cli to enable or disabled object store profiling
  • Integrates the command with the instrumented object stores to allow the user input to change the mode of the instrumented stores
  • Adds tests to exercise the expected behavior of the commands
  • Adds user docs for the commands/CLI options
  • Updates visibility of InstrumentedObjectStore now that it needs to be interacted with outside of its module

Are these changes tested?

Yes. New Unit tests have been added for all meaningful changes.

Example functional output from datafusion-cli

$ ../../target/debug/datafusion-cli --object-store-profiling enabled
DataFusion CLI v50.1.0
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.170 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Enabled, inner: HttpStore
> \object_store_profiling disabled
ObjectStore Profile mode set to Disabled
> CREATE EXTERNAL TABLE hits2
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet';
0 row(s) fetched.
Elapsed 4.615 seconds.

> \object_store_profiling enabled
ObjectStore Profile mode set to Enabled
> CREATE EXTERNAL TABLE hits3
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_3.parquet';
0 row(s) fetched.
Elapsed 3.349 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Enabled, inner: HttpStore
Instrumented Object Store: instrument_mode: Enabled, inner: HttpStore
Instrumented Object Store: instrument_mode: Enabled, inner: HttpStore

Are there any user-facing changes?

Yes. Unfortunately this work seems to necessitate adding a new public item to the datafusion-cli::print_options::PrintOptions type which is exposed as a public type. Other obvious user-facing changes are the addition of the new command and CLI option to datafusion-cli, which have been documented in the CLI user guide accordingly.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 9, 2025
 - Adds a CLI option and command to datafusion-cli to enable or disabled
   object store profiling
 - Integrates the command with the instrumented object stores to allow
   the user input to change the mode of the instrumented stores
 - Adds tests to exercise the expected behavior of the commands
 - Adds user docs for the commands/CLI options
 - Updates visibility of `InstrumentedObjectStore` now that it needs to
   be interacted with outside of its module
@BlakeOrth BlakeOrth force-pushed the feature/cli_profile_commands branch from 7afd282 to 4060b14 Compare October 9, 2025 23:55
@BlakeOrth
Copy link
Contributor Author

Oops, forgot to cc @alamb

Sorry this one is a bit longer. A decent chunk is added/new tests so hopefully the review for those lines ends up being quicker than the runtime code.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @BlakeOrth -- this looks great to me

I also played around with it locally:

> \object_store_profiling
ObjectStore Profile mode is Disabled
> \object_store_profiling
ObjectStore Profile mode is Disabled
> \object_store_profiling on
Execution error: Failed to parse input: on. Valid options are disabled, enabled
> \object_store_profiling enabled
ObjectStore Profile mode set to Enabled
> \object_store_profiling enabled
ObjectStore Profile mode set to Enabled
> \object_store_profiling disabled
ObjectStore Profile mode set to Disabled

pub async fn main() {
let my_ctx = MyUnionerContext::default();

let profile_mode = InstrumentedObjectStoreMode::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor code structure thing (not needed) but I think it would be nicer to encapsulate more of this construction into the InstrumentedObjectStoreRegistry

Maybe something like putting the default registries in ::new()
``rust
let instrumented_registry = Arc::new(InstrumentedObjectStoreRegistry::new());


And then maybe you can add builder style APIs if you need, like
```rust
let instrumented_registry = InstrumentedObjectStoreRegistry::new()
  .with_inner(inner_registry)
  .with_profile_mode(mode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicated code was also bothering me, but I was having trouble of thinking of a clean way to reduce the duplication at the time. I like this suggestion. Considering the InstrumentedObjectStoreRegistry is currently only ever used with a DefaultObjectStoreRegistry as it's inner I'm inclined to only implement with_profile_mode to avoid functionally dead code. What do you think?

Copy link
Contributor Author

@BlakeOrth BlakeOrth Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and modified new() and added the with_profile_mode() as a builder style method. I'm happy to also add with_inner() quickly if you'd prefer having that available in spite of it not (yet?) being used. This was an excellent change for the ergonomics of this type!


/// Returns the current [`InstrumentedObjectStoreMode`] for this
/// [`InstrumentedObjectStoreRegistry`]
pub fn mode(&self) -> InstrumentedObjectStoreMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please name the methods consistently? mode and set_instrument_mode are not consistent

I think this would be consistent with the other method and the name of the field

Suggested change
pub fn mode(&self) -> InstrumentedObjectStoreMode {
pub fn instrument_mode(&self) -> InstrumentedObjectStoreMode {

for store in self.instrumented_registry.stores() {
writeln!(writer, "{store}")?;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if (as a follow on PR) we should also move the timing into this structure too

I think we mentioned before it is somewhat strange that something named PrintOptions has an object store registry, but I see why it is like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the PrintOptions change is unfortunate here. If you have a recommendation on a better way to handle this I'd be happy to explore a different solution. The fact that we need to support the commands which all leverage print options is what pushed me towards this solution versus something like changing the method signature(s) to pass the registry through. Perhaps PrintOptions has just outgrown its original purpose and could use a refactor/rethink at some point down the road.

I wonder if (as a follow on PR) we should also move the timing into this structure too

I'm not sure I'm understanding you here. Do you mean keeping the records from the store(s) that will be printed in PrintOptions?

@BlakeOrth BlakeOrth force-pushed the feature/cli_profile_commands branch from 0e1b685 to e1a95af Compare October 10, 2025 18:38
 - Adds better methods to build an InstrumentedObjectStoreRegistry to
   reduce code duplication in common usage
 - Enhances test success criteria
 - Normalizes method names
quiet: false,
maxrows: datafusion_cli::print_options::MaxRows::Unlimited,
color: true,
instrumented_registry: Arc::new(InstrumentedObjectStoreRegistry::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb added this pull request to the merge queue Oct 10, 2025
Merged via the queue into apache:main with commit f210939 Oct 10, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants