Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new spaceflight starters #2838

Closed
merelcht opened this issue Jul 25, 2023 · 11 comments
Closed

Create new spaceflight starters #2838

merelcht opened this issue Jul 25, 2023 · 11 comments

Comments

@merelcht
Copy link
Member

merelcht commented Jul 25, 2023

Description

Follow up on #2758

Context

Create a new suite of starters all based on spaceflights and add them to a repo.
We'll need at least:

  1. spaceflights based on pandas (the existing spaceflights starter)
  2. spaceflights based on pyspark Create new spaceflights-pyspark starter #2984
  3. spaceflights based on pandas with viz features enabled Create new spaceflights-pandas-viz starter #2985
  4. spaceflights based on pyspark with viz features enabled Create new spaceflights-pyspark-viz starter #2986

Possible Additional examples

(Not to be done now #2838 (comment))

  • spaceflights with Databricks setup
  • spaceflights with Airflow setup

Note
When the full list of examples has been decided on we should break this ticket down so there's one for each example and they can be tackled separately.

@merelcht
Copy link
Member Author

As a follow up on this ticket we should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

@amandakys
Copy link

amandakys commented Jul 27, 2023

As a follow up to the conversation we had about #2844 and #2752, @DimedS @deepyaman would also like further discussion on the /conf set up of our starters. i.e. what/how many catalog and parameters files we supply.

@merelcht
Copy link
Member Author

As a follow up to the conversation we had about #2844 and #2752, @DmitriiDeriabinQB @deepyaman would also like further discussion on the /conf set up of our starters. i.e. what/how many catalog and parameters files we supply.

@amandakys Again I think you meant to tag @DimedS, the other Dmitrii left the team a while ago 😉

@deepyaman
Copy link
Member

As a follow up on this ticket we should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

💯 I think most repos I've seen that have this level of duplication have automated processes to generate the different versions.

@yetudada
Copy link
Contributor

I think we may need to consider an empty starter for PySpark. Alloy uses it to put code for the verticals in it. I'll leave @datajoely to add comment here.

@datajoely
Copy link
Contributor

Yes you are! We can manage this otherwise if you're desperate to delete it. But today we require a starter that has the hook + spark.yml in it. cc @marc-solomon @imdoroshenko

@stichbury
Copy link
Contributor

stichbury commented Jul 31, 2023

Also worth noting is that we have a plan to revise the spaceflights data #2008 and have recently considered reducing the size (@noklam commented "Any reason that we can’t trim the dataset? As a starter that it get used in demo and testing, it takes a considerate of time to run the pipeline. For example in the Kedro bootcamp I see demoing catalog.load("shuttles") takes like 15-20 seconds and is a bit awkward for demo purpose."). So we need one set of data/single location for it.

@astrojuanlu
Copy link
Member

But today we require a starter that has the hook + spark.yml in it.

Wondering if this could be pip install kedro-spark 🤔 I see lots of people copy-pasting the Spark hook, but I'm not sure how much flexibility is needed

@merelcht
Copy link
Member Author

As a follow up on this ticket we should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

💯 I think most repos I've seen that have this level of duplication have automated processes to generate the different versions.

@deepyaman Do you have any examples of repos like this? I'd be interested to see how they manage the different versions.

@yetudada
Copy link
Contributor

yetudada commented Aug 21, 2023

The revised plan of action for this is that to ship this feature in 0.19.0, we will create starters for:

  • Spaceflights based on Pandas (the existing Spaceflights starter) - spaceflights-pandas
  • Spaceflights based on PySpark - spaceflights-pyspark
  • Spaceflights based on Pandas with Kedro-Viz features enabled - spaceflights-pandas-viz
  • Spaceflights based on PySpark with viz features enabled - spaceflights-pyspark-viz

In the short-term, our support for Airflow and Databricks will be through documentation. We will have to tell users to edit spaceflights-pyspark. And then post 0.19.x, we'll look at improving the workflow when using Airflow and Databricks. At a later stage, we'll also look at how to reduce the duplication across these starters #2874.

To that end:

@merelcht
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants