Skip to content

Conversation

@lonless9
Copy link
Contributor

@lonless9 lonless9 commented Jun 30, 2025

part of #171

Delta table read/write operations in Spark SQL and DataFrame APIs.

@lonless9 lonless9 self-assigned this Jun 30, 2025
@lonless9 lonless9 changed the title feat: Delta Lake integration feat: delta lake integration Jun 30, 2025
@lonless9 lonless9 marked this pull request as ready for review July 17, 2025 06:10
@lonless9 lonless9 requested a review from linhr July 17, 2025 06:33
@lonless9 lonless9 changed the title feat: delta lake integration feat: delta lake Read/Write operations Jul 17, 2025
Copy link
Contributor

@linhr linhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!! Great work!!! 🚀

let mut all_batches = Vec::new();
let mut total_rows = 0u64;

// Execute all partitions and collect the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the existing implementation collects all data in memory and writes all data using a single process. It would be much more scalable if the writer tasks are distributed and ingest data in a streaming fashion.

(This is just a note to explain the future work.)

}

#[async_trait]
impl TableProvider for DeltaTableProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriterBuilder is used here in insert_into(), but we have similar writer logic in DeltaDataSink while we use TableProvider only for reading. Is my understanding correct?

Comment on lines 12 to 15
# Test constants
YEAR_2025 = 2025
YEAR_2026 = 2026
EXPECTED_RESULT_COUNT = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's Ok to not define constants here. Using the literal values directly in the tests could make them more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hatch fmt complains about it, I'll fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The Python linter can get annoying sometimes, especially in tests. You can simply bypass a certain rule for a particular line via # noqa: <RULE> comments (where <RULE> is the rule in violation).

@linhr linhr changed the title feat: delta lake Read/Write operations feat: basic read/write operations for Delta Lake Jul 17, 2025
@lonless9 lonless9 merged commit c2c28fb into main Jul 17, 2025
15 checks passed
@lonless9 lonless9 deleted the delta-lake-integration branch July 17, 2025 13:07
@lonless9 lonless9 mentioned this pull request Aug 29, 2025
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run spark tests Trigger Spark tests on a pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants