kedro-org · merelcht · Oct 2, 2023 · Sep 15, 2023 · Sep 15, 2023 · Sep 15, 2023
diff --git a/features/environment.py b/features/environment.py
@@ -52,6 +52,7 @@ def before_scenario(context, scenario):
         "pyspark",
         "pyspark-iris",
         "spaceflights",
+        "spaceflights-pyspark",
     ]
     starters_paths = {
         starter: str(starters_root / starter) for starter in starter_names

diff --git a/features/lint.feature b/features/lint.feature
@@ -34,3 +34,10 @@ Feature: Lint all starters
     And I have installed the Kedro project's dependencies
     When I lint the project
     Then I should get a successful exit code
+
+  Scenario: Lint spaceflights-pyspark starter
+    Given I have prepared a config file
+    And I have run a non-interactive kedro new with the starter spaceflights-pyspark
+    And I have installed the Kedro project's dependencies
+    When I lint the project
+    Then I should get a successful exit code
diff --git a/features/run.feature b/features/run.feature
@@ -36,3 +36,10 @@ Feature: Run all starters
     And I have installed the Kedro project's dependencies
     When I run the Kedro pipeline
     Then I should get a successful exit code
+
+  Scenario: Run a Kedro project created from spaceflights-pyspark
+    Given I have prepared a config file
+    And I have run a non-interactive kedro new with the starter spaceflights-pyspark
+    And I have installed the Kedro project's dependencies
+    When I run the Kedro pipeline
+    Then I should get a successful exit code
diff --git a/spaceflights-pyspark/README.md b/spaceflights-pyspark/README.md
@@ -0,0 +1,44 @@
+# The `spaceflights-pyspark` Kedro starter
+
+## Overview
+
+This is a variation of the [spaceflights tutorial project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) described in the [online Kedro documentation](https://docs.kedro.org) with `PySpark` setup.
+
+The code in this repository demonstrates best practice when working with Kedro and PySpark. It contains a Kedro starter template with some initial configuration and two example pipelines, and originates from the [Kedro documentation about how to work with PySpark](https://docs.kedro.org/en/stable/integrations/pyspark_integration.html).
+
+To use this starter, create a new Kedro project and select `pyspark` as add-on.
+
+```bash
+pip install kedro
+kedro new
+cd <my-project-name>  # change directory into newly created project directory
+```
+
+Install the required dependencies:
+
+```bash
+pip install -r src/requirements.txt
+```
+
+Now you can run the project:
+
+```bash
+kedro run
+```
+
+## Features
+
+### Single configuration in `/conf/base/spark.yml`
+
+While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location.
+
+### `SparkSession` initialisation with `SparkHooks`
+
+This Kedro starter contains the initialisation code for `SparkSession` in `hooks.py` and takes its configuration from `/conf/base/spark.yml`. Modify the `SparkHooks` code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
+
+### Uses transcoding to handle the same data in different formats
+
+In some cases it can be desirable to handle one dataset in different ways, for example to load a parquet file into your pipeline using `pandas` and to save it using `spark`. In this starter, one of the input datasets `shuttles`, is an excel file. 
+It's not possible to load an excel file directly into Spark, so we use transcoding to save the file as a `pandas.CSVDataset` first which then allows us to load it as a `spark.SparkDataset` further on in the pipeline.
+
+
diff --git a/spaceflights-pyspark/cookiecutter.json b/spaceflights-pyspark/cookiecutter.json
@@ -0,0 +1,6 @@
+{
+    "project_name": "Spaceflights Pyspark",
+    "repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
+    "python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
+    "kedro_version": "{{ cookiecutter.kedro_version }}"
+}
diff --git a/spaceflights-pyspark/prompts.yml b/spaceflights-pyspark/prompts.yml
@@ -0,0 +1,9 @@
+project_name:
+  title: "Project Name"
+  text: |
+    Please enter a human readable name for your new project.
+    Spaces, hyphens, and underscores are allowed.
+  regex_validator: "^[\\w -]{2,}$"
+  error_message: |
+    It must contain only alphanumeric symbols, spaces, underscores and hyphens and
+    be at least 2 characters long.
diff --git a/spaceflights-pyspark/{{ cookiecutter.repo_name }}/.gitignore b/spaceflights-pyspark/{{ cookiecutter.repo_name }}/.gitignore
@@ -0,0 +1,151 @@
+##########################
+# KEDRO PROJECT
+
+# ignore all local configuration
+conf/local/**
+!conf/local/.gitkeep
+
+# ignore potentially sensitive credentials files
+conf/**/*credentials*
+
+# ignore everything in the following folders
+data/**
+
+# except their sub-folders
+!data/**/
+
+# also keep all .gitkeep files
+!.gitkeep
+
+# keep also the example dataset
+!data/01_raw/*
+
+
+##########################
+# Common files
+
+# IntelliJ
+.idea/
+*.iml
+out/
+.idea_modules/
+
+### macOS
+*.DS_Store
+.AppleDouble
+.LSOverride
+.Trashes
+
+# Vim
+*~
+.*.swo
+.*.swp
+
+# emacs
+*~
+\#*\#
+/.emacs.desktop
+/.emacs.desktop.lock
+*.elc
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# C extensions
+*.so
+
+### Python template
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+.static_storage/
+.media/
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
diff --git a/spaceflights-pyspark/{{ cookiecutter.repo_name }}/README.md b/spaceflights-pyspark/{{ cookiecutter.repo_name }}/README.md
@@ -0,0 +1,34 @@
+# {{ cookiecutter.project_name }}
+
+## Overview
+
+This is your new Kedro project, which was generated using `Kedro {{ cookiecutter.kedro_version }}`.
+
+Take a look at the [Kedro documentation](https://docs.kedro.org) to get started.
+
+## Rules and guidelines
+
+In order to get the best out of the template:
+
+* Don't remove any lines from the `.gitignore` file we provide
+* Make sure your results can be reproduced by following a [data engineering convention](https://docs.kedro.org/en/stable/faq/faq.html#what-is-data-engineering-convention)
+* Don't commit data to your repository
+* Don't commit any credentials or your local configuration to your repository. Keep all your credentials and local configuration in `conf/local/`
+
+## How to install dependencies
+
+Declare any dependencies in `src/requirements.txt` for `pip` installation.
+
+To install them, run:
+
+```
+pip install -r src/requirements.txt
+```
+
+## How to run your Kedro pipeline
+
+You can run your Kedro project with:
+
+```
+kedro run
+```
diff --git a/spaceflights-pyspark/{{ cookiecutter.repo_name }}/conf/README.md b/spaceflights-pyspark/{{ cookiecutter.repo_name }}/conf/README.md
@@ -0,0 +1,22 @@
+# What is this for?
+
+This folder should be used to store configuration files used by Kedro or by separate tools.
+
+This file can be used to provide users with instructions for how to reproduce local configuration with their own credentials. You can edit the file however you like, but you may wish to retain the information below and add your own section in the section titled **Instructions**.
+
+## Local configuration
+
+The `local` folder should be used for configuration that is either user-specific (e.g. IDE configuration) or protected (e.g. security keys).
+
+> *Note:* Please do not check in any local configuration to version control.
+
+## Base configuration
+
+The `base` folder is for shared configuration, such as non-sensitive and project-related configuration that may be shared across team members.
+
+WARNING: Please do not put access credentials in the base configuration folder.
+
+## Instructions
+
+## Find out more
+You can find out more about configuration from the [user guide documentation](https://docs.kedro.org/en/stable/configuration/configuration_basics.html).