Store example data directly inside the datafusion-examples (#19141) #19319

cj-zhukov · 2025-12-14T13:22:10Z

Which issue does this PR close?

Closes #Store example data directly inside the datafusion-examples #19141.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

)

cj-zhukov · 2025-12-14T13:27:24Z

High-level overview

This PR stores example datasets directly inside the datafusion-examples crate and replaces inlined data with real files.

The goal is to keep examples self-contained and stable, avoiding dependencies on internal test data or external repositories that may change over time. This makes examples easier to run, review, and maintain, especially when the crate is used standalone.

That said, I’m not certain this is the best long-term solution. I’m open to discussing alternative approaches (for example, a shared or external dataset) if the community prefers a different way to store and manage example data.

datafusion-examples/examples/data_io/parquet_exec_visitor.rs

datafusion-examples/examples/builtin_functions/regexp.rs

datafusion-examples/examples/sql_ops/query.rs

cj-zhukov · 2025-12-16T05:58:57Z

@martin-g Thanks a lot for the detailed feedback and suggestions -- they were very helpful and I really appreciate the review.

One broader question I wanted to ask: what do you think about storing example data directly inside the datafusion-examples crate? My motivation here was to keep the examples self-contained and avoid dependencies on external datasets, but I’m not fully sure this is the best long-term approach.
Do you think this is reasonable, or would you prefer a different solution (for example, shared or external datasets) for managing example data?

martin-g · 2025-12-16T07:22:58Z

If the data is used only by the -examples crate then it should be in the crate itself.
If the same data is used also in another crate then it should be put in a neutral folder, e.g. "../shared/data/". If you need to make it look like a local data then you can use symbolic links.

cj-zhukov · 2025-12-17T09:34:38Z

@martin-g Thanks, that makes sense.
For this PR, I was treating these files as example-specific, but I see that they’re already used in other crates.
I’m happy to either:

keep them in datafusion-examples if we consider them example-only, or
move them to a shared location if that’s preferred going forward.

Since this affects project structure, I’d love to hear what others think would be the most appropriate location.

alamb · 2025-12-18T17:13:52Z

I don't have strong opinions one way or the other to be honest.

I do think it would be nice to avoid a copy if possible. I don't think there is anything really special about many of the files (e.g. aggregate_100), they just happened to be available when creating the example.

My suggestion is to consolidate the examples to a smaller number of example specific files (for example, can we rewrite the examples from using datafusion-examples/data/csv/aggregate_test_100.csv to use cars.csv?

cj-zhukov · 2025-12-19T08:32:52Z

aggregate_test_100.csv

Thanks, that clarification really helps.

Based on your feedback, I’ll move forward by keeping the example data inside the datafusion-examples crate, but refactoring the examples to rely on a smaller and cleaner set of example-focused datasets.

As a first step, I’ll replace aggregate_test_100.csv with cars.csv and update the examples accordingly. Later we can consolidate further and remove other legacy test files once we rewrite those examples.

This keeps the examples self-contained while avoiding unnecessary duplication of the larger test data.

cj-zhukov · 2025-12-19T14:53:34Z

I rewrote the examples from using aggregate_test_100.csv to use cars.csv

alamb · 2025-12-19T17:05:43Z

datafusion-examples/examples/dataframe/dataframe.rs


 /// Use the DataFrame API to execute the following subquery:
-/// select t1.c1, t1.c2 from t1 where t1.c2 in (select max(t2.c2) from t2 where t2.c1 > 0 ) limit 3;
+/// select t1.car, t1.speed from t1 where t1.speed in (select max(t2.speed) from t2 where t2.car = 'red') limit 3;


i actually think this is much more readable now

I totally agree with you

alamb · 2025-12-19T17:06:50Z

datafusion-examples/data/parquet/alltypes_plain.parquet

Can we avoid copying the alltypes-plain example file too? That likely is also just some arbitrary choice of file in the example that could be rewritten to something more useful

alamb · 2025-12-19T17:07:52Z

datafusion-examples/examples/data_io/parquet_encrypted.rs

-    // Find the local path of "alltypes_plain.parquet"
-    let testdata = datafusion::test_util::parquet_test_data();
-    let filename = &format!("{testdata}/alltypes_plain.parquet");
+    let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))


Maybe we can change this example to read in the cars.csv file and then write it back out as an encrypted parquet file?

Thanks, that makes sense.

I agree that using fewer, clearer datasets in the examples is a good direction. I can look into rewriting this parquet_encrypted example to read from cars.csv and then write the encrypted parquet output, instead of relying on alltypes_plain.parquet.

I’ll prototype that approach and report back once I confirm everything works cleanly with the encryption workflow.

cj-zhukov · 2025-12-23T12:08:42Z

Replaced the static alltypes_plain.parquet with cars.csv, which is read at runtime and written to a temporary Parquet directory.
This keeps the example self-contained while still exercising the same listing-table and Parquet query logic.

…-examples

Store example data directly inside the datafusion-examples (apache#19141

9e35cfe

)

run prettier

6daedb6

martin-g reviewed Dec 15, 2025

View reviewed changes

datafusion-examples/examples/data_io/parquet_exec_visitor.rs Outdated Show resolved Hide resolved

datafusion-examples/examples/builtin_functions/regexp.rs Show resolved Hide resolved

datafusion-examples/examples/sql_ops/query.rs Outdated Show resolved Hide resolved

martin-g reviewed Dec 15, 2025

View reviewed changes

datafusion-examples/examples/sql_ops/query.rs Outdated Show resolved Hide resolved

preserve file:// & fix comments

6603e3e

replace aggregate_test_100.csv with cars.csv

591f61c

alamb reviewed Dec 19, 2025

View reviewed changes

replace alltypes_plain.parquet with cars.csv

d77de01

cj-zhukov and others added 4 commits December 23, 2025 15:19

Merge branch 'main' into cj-zhukov/store-example-data-directly-inside…

b02e636

…-examples

fix fmt issues

7d35fcc

fix fmt issues

e6b30ac

Fix issues causing GitHub checks to fail

449f3aa

Store example data directly inside the datafusion-examples (#19141) #19319

Are you sure you want to change the base?

Store example data directly inside the datafusion-examples (#19141) #19319

Conversation

cj-zhukov commented Dec 14, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

cj-zhukov commented Dec 14, 2025

High-level overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cj-zhukov commented Dec 16, 2025

Uh oh!

martin-g commented Dec 16, 2025

Uh oh!

cj-zhukov commented Dec 17, 2025

Uh oh!

alamb commented Dec 18, 2025

Uh oh!

cj-zhukov commented Dec 19, 2025

Uh oh!

cj-zhukov commented Dec 19, 2025

Uh oh!

alamb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cj-zhukov Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cj-zhukov Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

cj-zhukov commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants