Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Subtask] support fileset DDL operations for spark-connector #2461

Open
caican00 opened this issue Mar 8, 2024 · 8 comments
Open

[Subtask] support fileset DDL operations for spark-connector #2461

caican00 opened this issue Mar 8, 2024 · 8 comments
Labels
subtask Subtasks of umbrella issue

Comments

@caican00
Copy link
Collaborator

caican00 commented Mar 8, 2024

Describe the subtask

support fileset DDL,such as create, drop, etc

Parent issue

#1227

@caican00 caican00 added the subtask Subtasks of umbrella issue label Mar 8, 2024
@caican00
Copy link
Collaborator Author

caican00 commented Mar 8, 2024

Hi @FANNG1 what do you think of this?

@FANNG1
Copy link
Contributor

FANNG1 commented Mar 8, 2024

From the perspective of user, Spark sql normanly operate on tables, How should Spark operate on fileset? cc @jerryshao

@caican00
Copy link
Collaborator Author

caican00 commented Mar 8, 2024

From the perspective of user, Spark sql normanly operate on tables, How should Spark operate on fileset? cc @jerryshao

Refer to databricks' volumn, which provides ddl operations for volumn cc @FANNG1 @jerryshao
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-volume.html

@jerryshao
Copy link
Contributor

I think spark/sparksql can support operating fileset data via SQL/RDD/Dataframe by using #1700 , we don't have to do anything more.

The linked above is to support manipulating volume(fileset) itself using SQL, it requires SQL extension. Currently, we don't have plan to do it.

@caican00
Copy link
Collaborator Author

caican00 commented Mar 8, 2024

I think spark/sparksql can support operating fileset data via SQL/RDD/Dataframe by using #1700 , we don't have to do anything more.

The linked above is to support manipulating volume(fileset) itself using SQL, it requires SQL extension. Currently, we don't have plan to do it.

cc @coolderli

@coolderli
Copy link
Collaborator

I think spark/sparksql can support operating fileset data via SQL/RDD/Dataframe by using #1700 , we don't have to do anything more.

The linked above is to support manipulating volume(fileset) itself using SQL, it requires SQL extension. Currently, we don't have a plan to do it.

@jerryshao Do we have a plan to support some Fileset operations such as List Files、Drop Files and so on? If we want to achieve TTL, we may need an interface to operate the Fileset. We may have some ambiguity about the positioning of the Fileset. The Fileset is managed by Gravitino, and we have already supported creating a table by Gravitino, why not support creating a Fileset? Some users may prefer to use SQL other than the UI.

Actually, I think it is truly not consistent with the position of Gravitino. But we can supply tools or actions to help users manage the Fileset. It may not be our current highest priority, but we can implement it later.

@jerryshao
Copy link
Contributor

I don't say we don't do it, what I said is that we don't have a plan to do it currently.

For ML users/DS, they can use our python client to manage filesets, it is much more straightforward than using SQL (which needs a separate query engine like Spark besides ML engine).

For DE, they can use Java client in their program (like Spark program) to achieve this.

Providing SQL interface is just an alternative compared to Java/Python, I don't see it is a must-have thing for now. SO IMO, I don't see a super high priority to achieve this in SQL. If you have a concrete scenario that requires SQL support, we can have a off-line discussion about this.

@coolderli
Copy link
Collaborator

I don't say we don't do it, what I said is that we don't have a plan to do it currently.

For ML users/DS, they can use our python client to manage filesets, it is much more straightforward than using SQL (which needs a separate query engine like Spark besides ML engine).

For DE, they can use Java client in their program (like Spark program) to achieve this.

Providing SQL interface is just an alternative compared to Java/Python, I don't see it is a must-have thing for now. SO IMO, I don't see a super high priority to achieve this in SQL. If you have a concrete scenario that requires SQL support, we can have a off-line discussion about this.

Much appreciate your response. No intention of offending. I completely agree with your point of view that this is not the highest priority right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
subtask Subtasks of umbrella issue
Projects
None yet
Development

No branches or pull requests

4 participants