Skip to content

Conversation

@cj-zhukov
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@cj-zhukov
Copy link
Contributor Author

High-level overview

This PR stores example datasets directly inside the datafusion-examples crate and replaces inlined data with real files.

The goal is to keep examples self-contained and stable, avoiding dependencies on internal test data or external repositories that may change over time. This makes examples easier to run, review, and maintain, especially when the crate is used standalone.

That said, I’m not certain this is the best long-term solution. I’m open to discussing alternative approaches (for example, a shared or external dataset) if the community prefers a different way to store and manage example data.

@cj-zhukov
Copy link
Contributor Author

@martin-g Thanks a lot for the detailed feedback and suggestions -- they were very helpful and I really appreciate the review.

One broader question I wanted to ask: what do you think about storing example data directly inside the datafusion-examples crate? My motivation here was to keep the examples self-contained and avoid dependencies on external datasets, but I’m not fully sure this is the best long-term approach.
Do you think this is reasonable, or would you prefer a different solution (for example, shared or external datasets) for managing example data?

@martin-g
Copy link
Member

If the data is used only by the -examples crate then it should be in the crate itself.
If the same data is used also in another crate then it should be put in a neutral folder, e.g. "../shared/data/". If you need to make it look like a local data then you can use symbolic links.

@cj-zhukov
Copy link
Contributor Author

@martin-g Thanks, that makes sense.
For this PR, I was treating these files as example-specific, but I see that they’re already used in other crates.
I’m happy to either:

  • keep them in datafusion-examples if we consider them example-only, or
  • move them to a shared location if that’s preferred going forward.

Since this affects project structure, I’d love to hear what others think would be the most appropriate location.

@alamb
Copy link
Contributor

alamb commented Dec 18, 2025

I don't have strong opinions one way or the other to be honest.

I do think it would be nice to avoid a copy if possible. I don't think there is anything really special about many of the files (e.g. aggregate_100), they just happened to be available when creating the example.

My suggestion is to consolidate the examples to a smaller number of example specific files (for example, can we rewrite the examples from using datafusion-examples/data/csv/aggregate_test_100.csv to use cars.csv?

@cj-zhukov
Copy link
Contributor Author

aggregate_test_100.csv

Thanks, that clarification really helps.

Based on your feedback, I’ll move forward by keeping the example data inside the datafusion-examples crate, but refactoring the examples to rely on a smaller and cleaner set of example-focused datasets.

As a first step, I’ll replace aggregate_test_100.csv with cars.csv and update the examples accordingly. Later we can consolidate further and remove other legacy test files once we rewrite those examples.

This keeps the examples self-contained while avoiding unnecessary duplication of the larger test data.

@cj-zhukov
Copy link
Contributor Author

I rewrote the examples from using aggregate_test_100.csv to use cars.csv


/// Use the DataFrame API to execute the following subquery:
/// select t1.c1, t1.c2 from t1 where t1.c2 in (select max(t2.c2) from t2 where t2.c1 > 0 ) limit 3;
/// select t1.car, t1.speed from t1 where t1.speed in (select max(t2.speed) from t2 where t2.car = 'red') limit 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually think this is much more readable now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid copying the alltypes-plain example file too? That likely is also just some arbitrary choice of file in the example that could be rewritten to something more useful

// Find the local path of "alltypes_plain.parquet"
let testdata = datafusion::test_util::parquet_test_data();
let filename = &format!("{testdata}/alltypes_plain.parquet");
let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can change this example to read in the cars.csv file and then write it back out as an encrypted parquet file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense.

I agree that using fewer, clearer datasets in the examples is a good direction. I can look into rewriting this parquet_encrypted example to read from cars.csv and then write the encrypted parquet output, instead of relying on alltypes_plain.parquet.

I’ll prototype that approach and report back once I confirm everything works cleanly with the encryption workflow.

@cj-zhukov
Copy link
Contributor Author

Replaced the static alltypes_plain.parquet with cars.csv, which is read at runtime and written to a temporary Parquet directory.
This keeps the example self-contained while still exercising the same listing-table and Parquet query logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants