-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Store example data directly inside the datafusion-examples (#19141) #19319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Store example data directly inside the datafusion-examples (#19141) #19319
Conversation
High-level overviewThis PR stores example datasets directly inside the The goal is to keep examples self-contained and stable, avoiding dependencies on internal test data or external repositories that may change over time. This makes examples easier to run, review, and maintain, especially when the crate is used standalone. That said, I’m not certain this is the best long-term solution. I’m open to discussing alternative approaches (for example, a shared or external dataset) if the community prefers a different way to store and manage example data. |
|
@martin-g Thanks a lot for the detailed feedback and suggestions -- they were very helpful and I really appreciate the review. One broader question I wanted to ask: what do you think about storing example data directly inside the |
|
If the data is used only by the -examples crate then it should be in the crate itself. |
|
@martin-g Thanks, that makes sense.
Since this affects project structure, I’d love to hear what others think would be the most appropriate location. |
|
I don't have strong opinions one way or the other to be honest. I do think it would be nice to avoid a copy if possible. I don't think there is anything really special about many of the files (e.g. aggregate_100), they just happened to be available when creating the example. My suggestion is to consolidate the examples to a smaller number of example specific files (for example, can we rewrite the examples from using datafusion-examples/data/csv/aggregate_test_100.csv to use cars.csv? |
Thanks, that clarification really helps. Based on your feedback, I’ll move forward by keeping the example data inside the datafusion-examples crate, but refactoring the examples to rely on a smaller and cleaner set of example-focused datasets. As a first step, I’ll replace This keeps the examples self-contained while avoiding unnecessary duplication of the larger test data. |
|
I rewrote the examples from using |
|
|
||
| /// Use the DataFrame API to execute the following subquery: | ||
| /// select t1.c1, t1.c2 from t1 where t1.c2 in (select max(t2.c2) from t2 where t2.c1 > 0 ) limit 3; | ||
| /// select t1.car, t1.speed from t1 where t1.speed in (select max(t2.speed) from t2 where t2.car = 'red') limit 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually think this is much more readable now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid copying the alltypes-plain example file too? That likely is also just some arbitrary choice of file in the example that could be rewritten to something more useful
| // Find the local path of "alltypes_plain.parquet" | ||
| let testdata = datafusion::test_util::parquet_test_data(); | ||
| let filename = &format!("{testdata}/alltypes_plain.parquet"); | ||
| let path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can change this example to read in the cars.csv file and then write it back out as an encrypted parquet file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense.
I agree that using fewer, clearer datasets in the examples is a good direction. I can look into rewriting this parquet_encrypted example to read from cars.csv and then write the encrypted parquet output, instead of relying on alltypes_plain.parquet.
I’ll prototype that approach and report back once I confirm everything works cleanly with the encryption workflow.
|
Replaced the static |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?