Partition as args in SparkHiveDataSet #725

jpoullet2000 · 2021-03-16T15:36:56Z

Description

I can only partition my data with SparkDataSet and not SparkHiveDataSet whereas I want to save my data in a Hive table, and so use the _save_validate method to make sure the columns are OK.

Context

Saving my data in Hive without having this extra validation on the schema might be risky.

Possible Alternatives

Why not using something similar to SparkDataSet: have a "save_args" in "__init__" method of SparkHiveDataSet and passing the "partitionBy" item. Then use it in the _insert_save method, using the partitionBy in the SQL statement: something like "INSERT INTO [TABLE] [db_name.]table_name [PARTITION partitionBy] select_statement"

antonymilne · 2021-03-17T21:26:36Z

Hello @jpoullet2000, welcome to kedro, and thank you very much for the suggestion. I don't know too much about spark so can't make any particularly insightful comments, but it sounds like a very sensible idea. My main question would be about how to implement it.

You suggest adding save_args to SparkHiveDataSet.__init__. This would be consistent with SparkDataSet, but I'm not sure it's the right approach here. save_args is typically used in a kedro dataset to be passed into the _save method as **save_args. This would not be possible in SparkHiveDataSet, since there's no function into which those arguments would be passed; instead we'd need to extract save_args["partition_by"] and insert into the SQL statement. If anyone tried specifying other arguments in save_args then they wouldn't actually do anything, which is potentially quite confusing.

Possibly a better approach would be to just add a new partition_by argument. We already have a few arguments here:

def __init__(
        self, database: str, table: str, write_mode: str, table_pk: List[str] = None
    ) -> None:

... and these are exactly the arguments used in the SQL query in _insert_save:

            f"insert into {self._database}.{self._table} select {columns} from tmp"  # nosec

So if partitioning is a common requirement for this dataset it would seem best to add it as an argument to __init__ rather than using save_args. Does this make sense to you?

brendalf · 2021-03-17T23:24:36Z

It makes sense to me, @AntonyMilneQB.
Is anybody planning to work on this? I would like to implement this.

antonymilne · 2021-03-18T10:21:54Z

@brendalf Go for it! You might like to have a quick read of our guide for contributors.

jpoullet2000 · 2021-03-18T12:18:33Z

Thanks guys! Le jeu. 18 mars 2021 à 11:22, Antony Milne ***@***.***> a écrit :

…

@brendalf <https://github.com/brendalf> Go for it! You might like to have a quick read of our guide for contributors <https://github.com/quantumblacklabs/kedro/blob/master/CONTRIBUTING.md>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#725 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFHEHWI33PYJA34OM5O3ULTEHH5LANCNFSM4ZIZII6Q> .

…o-develop Merge master into develop via merge-master-to-develop

brendalf · 2021-04-07T22:10:02Z

Hi @AntonyMilneQB, can you take a look at #745?

jiriklein · 2021-05-05T14:59:25Z

Hi @jpoullet2000, hope you're well!
SparkHiveDataSet has now been fully rewritten and partitionBy support has been added, including access to other save_args.
You can find the changes in the latest develop code. If you wish to wait for a proper release, you can expect these changes to materialise in version 0.18.0.
Hope this helps!

jpoullet2000 · 2021-05-07T08:29:30Z

Hi, Thanks a lot. I'm looking forward to testing it. KR, JB

…

On Wed, May 5, 2021 at 4:59 PM Jiri Klein ***@***.***> wrote: Hi @jpoullet2000 <https://github.com/jpoullet2000>, hope you're well! SparkHiveDataSet has now been fully rewritten and partitionBy support has been added, including access to other save_args. You can find the changes in the latest develop code. If you wish to wait for a proper release, you can expect these changes to materialise in version 0.18.0. Hope this helps! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#725 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFHEHXRYLWG3NEZVDSIE5LTMFMOJANCNFSM4ZIZII6Q> .

jpoullet2000 added the Issue: Feature Request New feature or improvement to existing feature label Mar 16, 2021

pull bot pushed a commit to vishalbelsare/kedro that referenced this issue Apr 4, 2021

Merge pull request kedro-org#725 from quantumblacklabs/merge-master-t…

e05acaf

…o-develop Merge master into develop via merge-master-to-develop

brendalf mentioned this issue Apr 5, 2021

Adds partition support to SparkHiveDataSet #745

Closed

6 tasks

jiriklein closed this as completed May 5, 2021

jiriklein self-assigned this May 5, 2021

Galileo-Galilei mentioned this issue Sep 20, 2021

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition as args in SparkHiveDataSet #725

Partition as args in SparkHiveDataSet #725

jpoullet2000 commented Mar 16, 2021

antonymilne commented Mar 17, 2021 •

edited

Loading

brendalf commented Mar 17, 2021

antonymilne commented Mar 18, 2021

jpoullet2000 commented Mar 18, 2021 via email

brendalf commented Apr 7, 2021

jiriklein commented May 5, 2021

jpoullet2000 commented May 7, 2021 via email

Partition as args in SparkHiveDataSet #725

Partition as args in SparkHiveDataSet #725

Comments

jpoullet2000 commented Mar 16, 2021

Description

Context

Possible Alternatives

antonymilne commented Mar 17, 2021 • edited Loading

brendalf commented Mar 17, 2021

antonymilne commented Mar 18, 2021

jpoullet2000 commented Mar 18, 2021 via email

brendalf commented Apr 7, 2021

jiriklein commented May 5, 2021

jpoullet2000 commented May 7, 2021 via email

antonymilne commented Mar 17, 2021 •

edited

Loading