-
Notifications
You must be signed in to change notification settings - Fork 69
feat: basic read/write operations for Delta Lake #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
linhr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!! Great work!!! 🚀
| let mut all_batches = Vec::new(); | ||
| let mut total_rows = 0u64; | ||
|
|
||
| // Execute all partitions and collect the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the existing implementation collects all data in memory and writes all data using a single process. It would be much more scalable if the writer tasks are distributed and ingest data in a streaming fashion.
(This is just a note to explain the future work.)
| } | ||
|
|
||
| #[async_trait] | ||
| impl TableProvider for DeltaTableProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriterBuilder is used here in insert_into(), but we have similar writer logic in DeltaDataSink while we use TableProvider only for reading. Is my understanding correct?
| # Test constants | ||
| YEAR_2025 = 2025 | ||
| YEAR_2026 = 2026 | ||
| EXPECTED_RESULT_COUNT = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's Ok to not define constants here. Using the literal values directly in the tests could make them more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hatch fmt complains about it, I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. The Python linter can get annoying sometimes, especially in tests. You can simply bypass a certain rule for a particular line via # noqa: <RULE> comments (where <RULE> is the rule in violation).
part of #171
Delta table read/write operations in Spark SQL and DataFrame APIs.