Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support directory paths in scans for Parquet, IPC and CSV #17017

Merged
merged 3 commits into from
Jun 18, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jun 17, 2024

This PR enables passing directory paths to scan_(parquet|ipc|csv), which will be recursively traversed to load all files.

There is also some refactoring around hive partition logic to prepare for future behavior changes.

@github-actions github-actions bot added the internal An internal refactor or improvement label Jun 17, 2024
Copy link

codspeed-hq bot commented Jun 17, 2024

CodSpeed Performance Report

Merging #17017 will not alter performance

Comparing nameexhaustion:scan-dir (8786193) with main (915eb08)

Summary

✅ 37 untouched benchmarks

@nameexhaustion nameexhaustion force-pushed the scan-dir branch 2 times, most recently from 7487f00 to 0d4526b Compare June 18, 2024 05:12
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

Attention: Patch coverage is 88.79668% with 27 lines in your changes missing coverage. Please review.

Project coverage is 80.92%. Comparing base (7e19e04) to head (666648e).
Report is 3 commits behind head on main.

Current head 666648e differs from pull request most recent head 8786193

Please upload reports for the commit 8786193 to get more accurate results.

Files Patch % Lines
crates/polars-lazy/src/scan/file_list_reader.rs 92.30% 9 Missing ⚠️
crates/polars-plan/src/plans/hive.rs 77.77% 6 Missing ⚠️
crates/polars-lazy/src/scan/csv.rs 72.72% 3 Missing ⚠️
crates/polars-lazy/src/scan/ipc.rs 77.77% 2 Missing ⚠️
crates/polars-lazy/src/scan/parquet.rs 84.61% 2 Missing ⚠️
...es/polars-mem-engine/src/executors/scan/parquet.rs 92.30% 2 Missing ⚠️
crates/polars-plan/src/plans/ir/dot.rs 0.00% 1 Missing ⚠️
py-polars/polars/io/parquet/functions.py 66.66% 1 Missing ⚠️
py-polars/src/lazyframe/visitor/nodes.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17017      +/-   ##
==========================================
- Coverage   80.93%   80.92%   -0.01%     
==========================================
  Files        1448     1448              
  Lines      190704   190710       +6     
  Branches     2723     2723              
==========================================
+ Hits       154338   154340       +2     
- Misses      35862    35866       +4     
  Partials      504      504              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion changed the title test directory scans feat: Support directory paths in scans for Parquet, IPC and CSV Jun 18, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 18, 2024
@@ -104,19 +205,9 @@ pub trait LazyFileListReader: Clone {
true
}

/// Path of the scanned file.
/// It can be potentially a glob pattern.
fn path(&self) -> &Path;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the path function from the trait - we just use a length-1 slice in paths for single paths

@@ -191,9 +192,6 @@ def data_file(
def test_scan(
capfd: Any, monkeypatch: pytest.MonkeyPatch, data_file: _DataFile, force_async: bool
) -> None:
if data_file.path.suffix == ".csv" and force_async:
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - enable these parametric tests for CSV now it supports async

}

// Todo:
// This maintains existing behavior - will remove very soon.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change this and add tests in a follow-up PR

@@ -80,6 +81,7 @@ pub enum DslPlan {
paths: Arc<[PathBuf]>,
// Option as this is mostly materialized on the IR phase.
file_info: Option<FileInfo>,
hive_parts: Option<Vec<Arc<HivePartitions>>>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this out from file_info - we store hive parts for every path in a Vec that we can resolve up-front - this replaces the current approach of using update_hive_partitions.

@ritchie46
Copy link
Member

Can you do a rebase? 🙈 The physical engine is moved to its own crate.

@nameexhaustion
Copy link
Collaborator Author

rebased 👍

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff. Left some questions.

crates/polars-lazy/src/scan/csv.rs Show resolved Hide resolved
crates/polars-lazy/src/scan/file_list_reader.rs Outdated Show resolved Hide resolved
@ritchie46 ritchie46 merged commit 306a918 into pola-rs:main Jun 18, 2024
26 checks passed
@nameexhaustion nameexhaustion deleted the scan-dir branch June 19, 2024 01:33
@nameexhaustion nameexhaustion self-assigned this Jun 24, 2024
@c-peters c-peters added the accepted Ready for implementation label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants