[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

sarchila · 2019-12-03T20:00:11Z

Description

This was discussed in a comment on a separate issue, but I figured it merited its own feature request, so I'll repeat here:

I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.

One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.

That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).

Since then I ended up taking my thought a step further by forking Kedro and coarsely removing the non-core functionality (branch here) that causes Kedro to depend on pandas, numpy, and other libraries that I considered not part of the 'core' Kedro runtime context/catalog/pipeline/node machinery. By providing my forked "Kedro core" branch to AWS Glue, I have been able to deploy my Kedro project and run it in Glue successfully 🎉

Context

This opens up the opportunities for Kedro to handle a purely Pyspark pipeline use-case and to allow for simple deployment to AWS Glue, a good choice for running spark in the cloud without the need for managing one's own cluster.

Possible Implementation

I've also been using the AWS CDK library, and thought Kedro could use a similar approach to what CDK uses: providing a 'core' library and have every other use-case-specific 'io' plugin as a separate small library that could be installed as needed. e.g. see https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#hello_world_tutorial_add_bucket

The text was updated successfully, but these errors were encountered:

lorenabalan · 2019-12-04T10:31:32Z

Good to hear back from you @sarchila! That is excellent news, thank you for sharing! This was added to our backlog a while ago, with a view to deliver in 2020. We welcome any contributions in this space if you are interested. :)

yetudada · 2020-02-05T18:41:23Z

We're on our way to this issue! We're launching these datasets in the next release: https://github.com/quantumblacklabs/kedro/tree/develop/kedro/extras/datasets

And we will give users time to use these ones instead. The major release following this will have io and contrib dependencies removed from Kedro.

sarchila · 2020-02-05T21:45:13Z

Great news @yetudada - thanks so much for your team's responsiveness on this issue 🙌

yetudada · 2020-03-13T17:39:04Z

@sarchila this issue can finally be closed. Commit ecd7277 has addressed this change. Thank you so much for submitting this request!

sarchila added the Issue: Feature Request New feature or improvement to existing feature label Dec 3, 2019

lorenabalan changed the title ~~Kedro 'core' library without included io DataSets or contrib.io~~ [KED-1242] Kedro 'core' library without included io DataSets or contrib.io Dec 4, 2019

yetudada added the Type: Opportunity Roadmap label Dec 10, 2019

yetudada mentioned this issue Dec 11, 2019

[KED-1273] Using transformers to specify Python objects in DataSets #182

Closed

yetudada removed the Issue: Feature Request New feature or improvement to existing feature label Feb 14, 2020

This was referenced Feb 26, 2020

[KED-1403] Delete kedro.contrib #235

Closed

[KED-1408] Delete some of the AbstractDataSets in kedro.io #241

Closed

yetudada closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

sarchila commented Dec 3, 2019 •

edited

Loading

lorenabalan commented Dec 4, 2019

yetudada commented Feb 5, 2020

sarchila commented Feb 5, 2020

yetudada commented Mar 13, 2020

[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

[KED-1242] Kedro 'core' library without included io DataSets or contrib.io #178

Comments

sarchila commented Dec 3, 2019 • edited Loading

Description

Context

Possible Implementation

lorenabalan commented Dec 4, 2019

yetudada commented Feb 5, 2020

sarchila commented Feb 5, 2020

yetudada commented Mar 13, 2020

sarchila commented Dec 3, 2019 •

edited

Loading