Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for InMemoryRelation #137

Closed
sunchao opened this issue Feb 29, 2024 · 3 comments · Fixed by #206
Closed

Add support for InMemoryRelation #137

sunchao opened this issue Feb 29, 2024 · 3 comments · Fixed by #206
Labels
enhancement New feature or request

Comments

@sunchao
Copy link
Member

sunchao commented Feb 29, 2024

What is the problem the feature request solves?

Currently Comet cannot be triggered if Spark users read data from cached RDD. To support this use case, we'll need to add support for Spark's InMemoryRelation.

It looks like we may need to implement Arrow for CachedBatchSerializer.

Describe the potential solution

Add Comet support for InMemoryRelation, so that Spark query starts from cached RDD can also use Comet native execution.

Additional context

It is not a priority as of now, but will be something good to have in future.

@sunchao sunchao added the enhancement New feature or request label Feb 29, 2024
@advancedxy
Copy link
Contributor

Another way to read InMemoryRelation is to wrapped it with an CometRowToColumnarExec like I proposed in #119

@sunchao
Copy link
Member Author

sunchao commented Mar 1, 2024

@advancedxy Yea, CometRowToColumnarExec could be a more general solution, not only for InMemoryRelation, but also for other types of data sources like CSV, JSON, etc. The advantage of implementing Arrow for CachedBatchSerializer here is that we can avoid the extra cost from row to columnar conversion, and potentially be more space efficient because of better compression.

@advancedxy
Copy link
Contributor

The advantage of implementing Arrow for CachedBatchSerializer here is that we can avoid the extra cost from row to columnar conversion, and potentially be more space efficient because of better compression.

Yea, of course. I can get the rational. We could always add specialized operators to improve performance as long as it's worth the effort and there's interest to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants