Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spaceflights-pyspark starter #147

Merged
merged 21 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions features/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ def before_scenario(context, scenario):
"pyspark",
"pyspark-iris",
"spaceflights",
"spaceflights-pyspark",
]
starters_paths = {
starter: str(starters_root / starter) for starter in starter_names
Expand Down
7 changes: 7 additions & 0 deletions features/lint.feature
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,10 @@ Feature: Lint all starters
And I have installed the Kedro project's dependencies
When I lint the project
Then I should get a successful exit code

Scenario: Lint spaceflights-pyspark starter
Given I have prepared a config file
And I have run a non-interactive kedro new with the starter spaceflights-pyspark
And I have installed the Kedro project's dependencies
When I lint the project
Then I should get a successful exit code
7 changes: 7 additions & 0 deletions features/run.feature
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,10 @@ Feature: Run all starters
And I have installed the Kedro project's dependencies
When I run the Kedro pipeline
Then I should get a successful exit code

Scenario: Run a Kedro project created from spaceflights-pyspark
Given I have prepared a config file
And I have run a non-interactive kedro new with the starter spaceflights-pyspark
And I have installed the Kedro project's dependencies
When I run the Kedro pipeline
Then I should get a successful exit code
44 changes: 44 additions & 0 deletions spaceflights-pyspark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# The `spaceflights-pyspark` Kedro starter

## Overview

This is a variation of the [spaceflights tutorial project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) described in the [online Kedro documentation](https://docs.kedro.org) with `PySpark` setup.

The code in this repository demonstrates best practice when working with Kedro and PySpark. It contains a Kedro starter template with some initial configuration and two example pipelines, and originates from the [Kedro documentation about how to work with PySpark](https://docs.kedro.org/en/stable/integrations/pyspark_integration.html).

To use this starter, create a new Kedro project and select `pyspark` as add-on.

```bash
pip install kedro
kedro new
cd <my-project-name> # change directory into newly created project directory
```

Install the required dependencies:

```bash
pip install -r src/requirements.txt
```

Now you can run the project:

```bash
kedro run
```

## Features

### Single configuration in `/conf/base/spark.yml`

While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location.

### `SparkSession` initialisation with `SparkHooks`

This Kedro starter contains the initialisation code for `SparkSession` in `hooks.py` and takes its configuration from `/conf/base/spark.yml`. Modify the `SparkHooks` code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).

### Uses transcoding to handle the same data in different formats

In some cases it can be desirable to handle one dataset in different ways, for example to load a parquet file into your pipeline using `pandas` and to save it using `spark`. In this starter, one of the input datasets `shuttles`, is an excel file.
It's not possible to load an excel file directly into Spark, so we use transcoding to save the file as a `pandas.CSVDataset` first which then allows us to load it as a `spark.SparkDataset` further on in the pipeline.


6 changes: 6 additions & 0 deletions spaceflights-pyspark/cookiecutter.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"project_name": "Spaceflights Pyspark",
"repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
"python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
"kedro_version": "{{ cookiecutter.kedro_version }}"
}
9 changes: 9 additions & 0 deletions spaceflights-pyspark/prompts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
project_name:
title: "Project Name"
text: |
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.
regex_validator: "^[\\w -]{2,}$"
error_message: |
It must contain only alphanumeric symbols, spaces, underscores and hyphens and
be at least 2 characters long.
151 changes: 151 additions & 0 deletions spaceflights-pyspark/{{ cookiecutter.repo_name }}/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
##########################
# KEDRO PROJECT

# ignore all local configuration
conf/local/**
!conf/local/.gitkeep

# ignore potentially sensitive credentials files
conf/**/*credentials*

# ignore everything in the following folders
data/**

# except their sub-folders
!data/**/

# also keep all .gitkeep files
!.gitkeep

# keep also the example dataset
!data/01_raw/*


##########################
# Common files

# IntelliJ
.idea/
*.iml
out/
.idea_modules/

### macOS
*.DS_Store
.AppleDouble
.LSOverride
.Trashes

# Vim
*~
.*.swo
.*.swp

# emacs
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc

# JIRA plugin
atlassian-ide-plugin.xml

# C extensions
*.so

### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# mkdocs documentation
/site

# mypy
.mypy_cache/
34 changes: 34 additions & 0 deletions spaceflights-pyspark/{{ cookiecutter.repo_name }}/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# {{ cookiecutter.project_name }}

## Overview

This is your new Kedro project, which was generated using `Kedro {{ cookiecutter.kedro_version }}`.

Take a look at the [Kedro documentation](https://docs.kedro.org) to get started.

## Rules and guidelines

In order to get the best out of the template:

* Don't remove any lines from the `.gitignore` file we provide
* Make sure your results can be reproduced by following a [data engineering convention](https://docs.kedro.org/en/stable/faq/faq.html#what-is-data-engineering-convention)
* Don't commit data to your repository
* Don't commit any credentials or your local configuration to your repository. Keep all your credentials and local configuration in `conf/local/`

## How to install dependencies

Declare any dependencies in `src/requirements.txt` for `pip` installation.

To install them, run:

```
pip install -r src/requirements.txt
```

## How to run your Kedro pipeline

You can run your Kedro project with:

```
kedro run
```
22 changes: 22 additions & 0 deletions spaceflights-pyspark/{{ cookiecutter.repo_name }}/conf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# What is this for?

This folder should be used to store configuration files used by Kedro or by separate tools.

This file can be used to provide users with instructions for how to reproduce local configuration with their own credentials. You can edit the file however you like, but you may wish to retain the information below and add your own section in the section titled **Instructions**.

## Local configuration

The `local` folder should be used for configuration that is either user-specific (e.g. IDE configuration) or protected (e.g. security keys).

> *Note:* Please do not check in any local configuration to version control.

## Base configuration

The `base` folder is for shared configuration, such as non-sensitive and project-related configuration that may be shared across team members.

WARNING: Please do not put access credentials in the base configuration folder.

## Instructions

## Find out more
You can find out more about configuration from the [user guide documentation](https://docs.kedro.org/en/stable/configuration/configuration_basics.html).
Loading