Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition as args in SparkHiveDataSet #725

Closed
jpoullet2000 opened this issue Mar 16, 2021 · 7 comments
Closed

Partition as args in SparkHiveDataSet #725

jpoullet2000 opened this issue Mar 16, 2021 · 7 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@jpoullet2000
Copy link

Description

I can only partition my data with SparkDataSet and not SparkHiveDataSet whereas I want to save my data in a Hive table, and so use the _save_validate method to make sure the columns are OK.

Context

Saving my data in Hive without having this extra validation on the schema might be risky.

Possible Alternatives

Why not using something similar to SparkDataSet: have a "save_args" in "__init__" method of SparkHiveDataSet and passing the "partitionBy" item. Then use it in the _insert_save method, using the partitionBy in the SQL statement: something like "INSERT INTO [TABLE] [db_name.]table_name [PARTITION partitionBy] select_statement"

@jpoullet2000 jpoullet2000 added the Issue: Feature Request New feature or improvement to existing feature label Mar 16, 2021
@antonymilne
Copy link
Contributor

antonymilne commented Mar 17, 2021

Hello @jpoullet2000, welcome to kedro, and thank you very much for the suggestion. I don't know too much about spark so can't make any particularly insightful comments, but it sounds like a very sensible idea. My main question would be about how to implement it.

You suggest adding save_args to SparkHiveDataSet.__init__. This would be consistent with SparkDataSet, but I'm not sure it's the right approach here. save_args is typically used in a kedro dataset to be passed into the _save method as **save_args. This would not be possible in SparkHiveDataSet, since there's no function into which those arguments would be passed; instead we'd need to extract save_args["partition_by"] and insert into the SQL statement. If anyone tried specifying other arguments in save_args then they wouldn't actually do anything, which is potentially quite confusing.

Possibly a better approach would be to just add a new partition_by argument. We already have a few arguments here:

def __init__(
        self, database: str, table: str, write_mode: str, table_pk: List[str] = None
    ) -> None:

... and these are exactly the arguments used in the SQL query in _insert_save:

            f"insert into {self._database}.{self._table} select {columns} from tmp"  # nosec

So if partitioning is a common requirement for this dataset it would seem best to add it as an argument to __init__ rather than using save_args. Does this make sense to you?

@brendalf
Copy link

It makes sense to me, @AntonyMilneQB.
Is anybody planning to work on this? I would like to implement this.

@antonymilne
Copy link
Contributor

@brendalf Go for it! You might like to have a quick read of our guide for contributors.

@jpoullet2000
Copy link
Author

jpoullet2000 commented Mar 18, 2021 via email

pull bot pushed a commit to vishalbelsare/kedro that referenced this issue Apr 4, 2021
…o-develop

Merge master into develop via merge-master-to-develop
@brendalf
Copy link

brendalf commented Apr 7, 2021

Hi @AntonyMilneQB, can you take a look at #745?

@jiriklein
Copy link
Contributor

Hi @jpoullet2000, hope you're well!
SparkHiveDataSet has now been fully rewritten and partitionBy support has been added, including access to other save_args.
You can find the changes in the latest develop code. If you wish to wait for a proper release, you can expect these changes to materialise in version 0.18.0.
Hope this helps!

@jiriklein jiriklein self-assigned this May 5, 2021
@jpoullet2000
Copy link
Author

jpoullet2000 commented May 7, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants