Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Store schema and spec in TaskContext to avoid unnecessary deserialization (#11235) #11280

Closed
wants to merge 1 commit into from

Conversation

gitzwz
Copy link

@gitzwz gitzwz commented Oct 8, 2024

No description provided.

@github-actions github-actions bot added the core label Oct 8, 2024
@gitzwz
Copy link
Author

gitzwz commented Oct 11, 2024

This works well when table's schema is over 1k, and when there is a need to read table specs & schema after table scan. In our case(2k column, 25.9TB, 460,000 files), this can reduce the time required to get all specs after tableScan from over 10 min to 30s.

@aokolnychyi @RussellSpitzer please take a look when you have time~

@singhpk234
Copy link
Contributor

This sounds interesting, is it just the ser-de that causes this ? what about the increase in memory pressure to hold this in memory ?

@gitzwz
Copy link
Author

gitzwz commented Nov 4, 2024

This sounds interesting, is it just the ser-de that causes this ? what about the increase in memory pressure to hold this in memory ?

Yes for the first question, in my SDK service, I need to get all the spec id (do ser-de) for every FileScanTask, this is one reason why it is so slow. Sure for the increase in memory pressure, it's a good question. Is there another way to solve ser-de problem? I collect all spec id to check if one column was previously used as a partition field.

Copy link

github-actions bot commented Dec 5, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 5, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants