- Fixed a bug for using
ParallelRunner
on Windows. - Enabled auto-discovery of hooks implementations coming from installed plugins.
- Added the
kedro pipeline pull
CLI command to extract a packaged modular pipeline, and place the contents in a Kedro project. - Added the
--version
option tokedro pipeline package
to allow specifying alternative versions to package under. - Added the
--starter
option tokedro new
to create a new project from a local, remote or aliased starter template. - Added the
kedro starter list
CLI command to list all starter templates that can be used to bootstrap a new Kedro project. - Added
json.JSONDataSet
- Removed
/src/nodes
directory from the project template and madekedro jupyter convert
create it on the fly if necessary. - Fixed a bug in
MatplotlibWriter
which prevented saving lists and dictionaries of plots locally on Windows. - Closed all pyplot windows after saving in
MatplotlibWriter
. - Documentation improvements:
- Added kedro-wings and kedro-great to the list of community plugins.
- Fixed broken versioning for Windows paths.
- Fixed
DataSet
string representation for falsy values. - Improved the error message when duplicate nodes are passed to the
Pipeline
initializer. - Fixed a bug where
kedro docs
would fail because the built docs were located in a different directory. - Fixed a bug where
ParallelRunner
would fail on Windows machines whose reported CPU count exceeded 61. - Fixed an issue with saving TensorFlow model to
h5
file on Windows. - Added a
json
parameter toAPIDataSet
for the convenience of generating requests with JSON bodies. - Fixed dependencies for
SparkDataSet
to include spark.
Deepyaman Datta, Tam-Sanh Nguyen, DataEngineerOne
- Added the following new datasets.
Type | Description | Location |
---|---|---|
pandas.AppendableExcelDataSet |
Works with Excel file opened in append mode |
kedro.extras.datasets.pandas |
tensorflow.TensorFlowModelDataset |
Works with TensorFlow models using TensorFlow 2.X |
kedro.extras.datasets.tensorflow |
holoviews.HoloviewsWriter |
Works with Holoviews objects (saves as image file) |
kedro.extras.datasets.holoviews |
kedro install
will now compile project dependencies (by runningkedro build-reqs
behind the scenes) before the installation if thesrc/requirements.in
file doesn't exist.- Added
only_nodes_with_namespace
inPipeline
class to filter only nodes with a specified namespace. - Added the
kedro pipeline delete
command to help delete unwanted or unused pipelines (it won't remove references to the pipeline in yourcreate_pipelines()
code). - Added the
kedro pipeline package
command to help package up a modular pipeline. It will bundle up the pipeline source code, tests, and parameters configuration into a .whl file.
- Improvement in
DataCatalog
:- Introduced regex filtering to the
DataCatalog.list()
method. - Non-alphanumeric characters (except underscore) in dataset name are replaced with
__
inDataCatalog.datasets
, for ease of access to transcoded datasets.
- Introduced regex filtering to the
- Improvement in Datasets:
- Improved initialization speed of
spark.SparkHiveDataSet
. - Improved S3 cache in
spark.SparkDataSet
. - Added support of options for building
pyarrow
table inpandas.ParquetDataSet
.
- Improved initialization speed of
- Improvement in
kedro build-reqs
CLI command:kedro build-reqs
is now called with-q
option and will no longer print out compiled requirements to the console for security reasons.- All unrecognized CLI options in
kedro build-reqs
command are now passed to pip-compile call (e.g.kedro build-reqs --generate-hashes
).
- Improvement in
kedro jupyter
CLI command:- Improved error message when running
kedro jupyter notebook
,kedro jupyter lab
orkedro ipython
with Jupyter/IPython dependencies not being installed. - Fixed
%run_viz
line magic for showing kedro viz inside a Jupyter notebook. For the fix to be applied on existing Kedro project, please see the migration guide. - Fixed the bug in IPython startup script (issue 298).
- Improved error message when running
- Documentation improvements:
- Updated community-generated content in FAQ.
- Added find-kedro and kedro-static-viz to the list of community plugins.
- Add missing
pillow.ImageDataSet
entry to the documentation.
Even though this release ships a fix for project generated with kedro==0.16.2
, after upgrading, you will still need to make a change in your existing project if it was generated with kedro>=0.16.0,<=0.16.1
for the fix to take effect. Specifically, please change the content of your project's IPython init script located at .ipython/profile_default/startup/00-kedro-init.py
with the content of this file. You will also need kedro-viz>=3.3.1
.
Miguel Rodriguez Gutierrez, Joel Schwarzmann, w0rdsm1th, Deepyaman Datta, Tam-Sanh Nguyen, Marcus Gawronsky
- Fixed deprecation warnings from
kedro.cli
andkedro.context
when runningkedro jupyter notebook
. - Fixed a bug where
catalog
andcontext
were not available in Jupyter Lab and Notebook. - Fixed a bug where
kedro build-reqs
would fail if you didn't have your project dependencies installed.
- Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):
kedro catalog list
to list datasets in your catalogkedro pipeline list
to list pipelineskedro pipeline describe
to describe a specific pipelinekedro pipeline create
to create a modular pipeline
- Improved the CLI speed by up to 50%.
- Improved error handling when making a typo on the CLI. We now suggest some of the possible commands you meant to type, in
git
-style.
- All modules in
kedro.cli
andkedro.context
have been moved intokedro.framework.cli
andkedro.framework.context
respectively.kedro.cli
andkedro.context
will be removed in future releases. - Added
Hooks
, which is a new mechanism for extending Kedro. - Fixed
load_context
changing user's current working directory. - Allowed the source directory to be configurable in
.kedro.yml
. - Added the ability to specify nested parameter values inside your node inputs, e.g.
node(func, "params:a.b", None)
- Added the following new datasets.
Type | Description | Location |
---|---|---|
pillow.ImageDataSet |
Work with image files using Pillow |
kedro.extras.datasets.pillow |
geopandas.GeoJSONDataSet |
Work with geospatial data using GeoPandas |
kedro.extras.datasets.geopandas.GeoJSONDataSet |
api.APIDataSet |
Work with data from HTTP(S) API requests | kedro.extras.datasets.api.APIDataSet |
- Added
joblib
backend support topickle.PickleDataSet
. - Added versioning support to
MatplotlibWriter
dataset. - Added the ability to install dependencies for a given dataset with more granularity, e.g.
pip install "kedro[pandas.ParquetDataSet]"
. - Added the ability to specify extra arguments, e.g.
encoding
orcompression
, forfsspec.spec.AbstractFileSystem.open()
calls when loading/saving a dataset. See Example 3 under docs.
- Added
namespace
property onNode
, related to the modular pipeline where the node belongs. - Added an option to enable asynchronous loading inputs and saving outputs in both
SequentialRunner(is_async=True)
andParallelRunner(is_async=True)
class. - Added
MemoryProfiler
transformer. - Removed the requirement to have all dependencies for a dataset module to use only a subset of the datasets within.
- Added support for
pandas>=1.0
. - Enabled Python 3.8 compatibility. Please note that a Spark workflow may be unreliable for this Python version as
pyspark
is not fully-compatible with 3.8 yet. - Renamed "features" layer to "feature" layer to be consistent with (most) other layers and the relevant FAQ.
- Fixed a bug where a new version created mid-run by an external system caused inconsistencies in the load versions used in the current run.
- Documentation improvements
- Added instruction in the documentation on how to create a custom runner).
- Updated contribution process in
CONTRIBUTING.md
- added Developer Workflow. - Documented installation of development version of Kedro in the FAQ section.
- Added missing
_exists
method toMyOwnDataSet
example in 04_user_guide/08_advanced_io.
- Fixed a bug where
PartitionedDataSet
andIncrementalDataSet
were not working withs3a
ors3n
protocol. - Added ability to read partitioned parquet file from a directory in
pandas.ParquetDataSet
. - Replaced
functools.lru_cache
withcachetools.cachedmethod
inPartitionedDataSet
andIncrementalDataSet
for per-instance cache invalidation. - Implemented custom glob function for
SparkDataSet
when running on Databricks. - Fixed a bug in
SparkDataSet
not allowing for loading data from DBFS in a Windows machine using Databricks-connect. - Improved the error message for
DataSetNotFoundError
to suggest possible dataset names user meant to type. - Added the option for contributors to run Kedro tests locally without Spark installation with
make test-no-spark
. - Added option to lint the project without applying the formatting changes (
kedro lint --check-only
).
- Deleted obsolete datasets from
kedro.io
. - Deleted
kedro.contrib
andextras
folders. - Deleted obsolete
CSVBlobDataSet
andJSONBlobDataSet
dataset types. - Made
invalidate_cache
method on datasets private. get_last_load_version
andget_last_save_version
methods are no longer available onAbstractDataSet
.get_last_load_version
andget_last_save_version
have been renamed toresolve_load_version
andresolve_save_version
onAbstractVersionedDataSet
, the results of which are cached.- The
release()
method on datasets extendingAbstractVersionedDataSet
clears the cached load and save version. All custom datasets must callsuper()._release()
inside_release()
. TextDataSet
no longer hasload_args
andsave_args
. These can instead be specified underopen_args_load
oropen_args_save
infs_args
.PartitionedDataSet
andIncrementalDataSet
methodinvalidate_cache
was made private:_invalidate_caches
.
- Removed
KEDRO_ENV_VAR
fromkedro.context
to speed up the CLI run time. Pipeline.name
has been removed in favour ofPipeline.tag()
.- Dropped
Pipeline.transform()
in favour ofkedro.pipeline.modular_pipeline.pipeline()
helper function. - Made constant
PARAMETER_KEYWORDS
private, and moved it fromkedro.pipeline.pipeline
tokedro.pipeline.modular_pipeline
. - Layers are no longer part of the dataset object, as they've moved to the
DataCatalog
. - Python 3.5 is no longer supported by the current and all future versions of Kedro.
reminder How do I upgrade Kedro covers a few key things to remember when updating any kedro version.
Since all the datasets (from kedro.io
and kedro.contrib.io
) were moved to kedro/extras/datasets
you must update the type of all datasets in <project>/conf/base/catalog.yml
file.
Here how it should be changed: type: <SomeDataSet>
-> type: <subfolder of kedro/extras/datasets>.<SomeDataSet>
(e.g. type: CSVDataSet
-> type: pandas.CSVDataSet
).
In addition, all the specific datasets like CSVLocalDataSet
, CSVS3DataSet
etc. were deprecated. Instead, you must use generalized datasets like CSVDataSet
.
E.g. type: CSVS3DataSet
-> type: pandas.CSVDataSet
.
Note: No changes required if you are using your custom dataset.
Pipeline.transform()
has been dropped in favour of the pipeline()
constructor. The following changes apply:
- Remember to import
from kedro.pipeline import pipeline
- The
prefix
argument has been renamed tonamespace
- And
datasets
has been broken down into more granular arguments:inputs
: Independent inputs to the pipelineoutputs
: Any output created in the pipeline, whether an intermediary dataset or a leaf outputparameters
:params:...
orparameters
As an example, code that used to look like this with the Pipeline.transform()
constructor:
result = my_pipeline.transform(
datasets={"input": "new_input", "output": "new_output", "params:x": "params:y"},
prefix="pre"
)
When used with the new pipeline()
constructor, becomes:
from kedro.pipeline import pipeline
result = pipeline(
my_pipeline,
inputs={"input": "new_input"},
outputs={"output": "new_output"},
parameters={"params:x": "params:y"},
namespace="pre"
)
Since some modules were moved to other locations you need to update import paths appropriately.
You can find the list of moved files in the 0.15.6
release notes under the section titled Files with a new location
.
Note: If you haven't made significant changes to your
kedro_cli.py
, it may be easier to simply copy the updatedkedro_cli.py
.ipython/profile_default/startup/00-kedro-init.py
and from GitHub or a newly generated project into your old project.
- We've removed
KEDRO_ENV_VAR
fromkedro.context
. To get your existing project template working, you'll need to remove all instances ofKEDRO_ENV_VAR
from your project template:- From the imports in
kedro_cli.py
and.ipython/profile_default/startup/00-kedro-init.py
:from kedro.context import KEDRO_ENV_VAR, load_context
->from kedro.framework.context import load_context
- Remove the
envvar=KEDRO_ENV_VAR
line from the click options inrun
,jupyter_notebook
andjupyter_lab
inkedro_cli.py
- Replace
KEDRO_ENV_VAR
with"KEDRO_ENV"
in_build_jupyter_env
- Replace
context = load_context(path, env=os.getenv(KEDRO_ENV_VAR))
withcontext = load_context(path)
in.ipython/profile_default/startup/00-kedro-init.py
- From the imports in
We have upgraded pip-tools
which is used by kedro build-reqs
to 5.x. This pip-tools
version requires pip>=20.0
. To upgrade pip
, please refer to their documentation.
@foolsgold, Mani Sarkar, Priyanka Shanbhag, Luis Blanche, Deepyaman Datta, Antony Milne, Panos Psimatikas, Tam-Sanh Nguyen, Tomasz Kaczmarczyk, Kody Fischer, Waylon Walker
- Pinned
fsspec>=0.5.1, <0.7.0
ands3fs>=0.3.0, <0.4.1
to fix incompatibility issues with their latest release.
- Added the additional libraries to our
requirements.txt
sopandas.CSVDataSet
class works out of box withpip install kedro
. - Added
pandas
to ourextra_requires
insetup.py
. - Improved the error message when dependencies of a
DataSet
class are missing.
- Added in documentation on how to contribute a custom
AbstractDataSet
implementation.
- Fixed the link to the Kedro banner image in the documentation.
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
Type | Description | Location |
---|---|---|
ParquetDataSet |
Handles parquet datasets using Dask | kedro.extras.datasets.dask |
PickleDataSet |
Work with Pickle files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pickle |
CSVDataSet |
Work with CSV files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
TextDataSet |
Work with text files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
ExcelDataSet |
Work with Excel files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
HDFDataSet |
Work with HDF using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
YAMLDataSet |
Work with YAML files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.yaml |
MatplotlibWriter |
Save with Matplotlib images using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.matplotlib |
NetworkXDataSet |
Work with NetworkX files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.networkx |
BioSequenceDataSet |
Work with bio-sequence objects using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.biosequence |
GBQTableDataSet |
Work with Google BigQuery | kedro.extras.datasets.pandas |
FeatherDataSet |
Work with feather files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
IncrementalDataSet |
Inherit from PartitionedDataSet and remembers the last processed partition |
kedro.io |
Type | New Location |
---|---|
JSONDataSet |
kedro.extras.datasets.pandas |
CSVBlobDataSet |
kedro.extras.datasets.pandas |
JSONBlobDataSet |
kedro.extras.datasets.pandas |
SQLTableDataSet |
kedro.extras.datasets.pandas |
SQLQueryDataSet |
kedro.extras.datasets.pandas |
SparkDataSet |
kedro.extras.datasets.spark |
SparkHiveDataSet |
kedro.extras.datasets.spark |
SparkJDBCDataSet |
kedro.extras.datasets.spark |
kedro/contrib/decorators/retry.py |
kedro/extras/decorators/retry_node.py |
kedro/contrib/decorators/memory_profiler.py |
kedro/extras/decorators/memory_profiler.py |
kedro/contrib/io/transformers/transformers.py |
kedro/extras/transformers/time_profiler.py |
kedro/contrib/colors/logging/color_logger.py |
kedro/extras/logging/color_logger.py |
extras/ipython_loader.py |
tools/ipython/ipython_loader.py |
kedro/contrib/io/cached/cached_dataset.py |
kedro/io/cached_dataset.py |
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py |
kedro/io/data_catalog_with_default.py |
kedro/contrib/config/templated_config.py |
kedro/config/templated_config.py |
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet |
CSVGCSDataSet |
|
CSVHTTPDataSet |
|
CSVLocalDataSet |
|
CSVS3DataSet |
|
ExcelLocalDataSet |
|
FeatherLocalDataSet |
|
JSONGCSDataSet |
|
JSONLocalDataSet |
|
HDFLocalDataSet |
|
HDFS3DataSet |
|
kedro.contrib.io.cached.CachedDataSet |
|
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault |
|
MatplotlibLocalWriter |
|
MatplotlibS3Writer |
|
NetworkXLocalDataSet |
|
ParquetGCSDataSet |
|
ParquetLocalDataSet |
|
ParquetS3DataSet |
|
PickleLocalDataSet |
|
PickleS3DataSet |
|
TextLocalDataSet |
|
YAMLLocalDataSet |
|
Decorators | kedro.contrib.decorators.memory_profiler |
kedro.contrib.decorators.retry |
|
kedro.contrib.decorators.pyspark.spark_to_pandas |
|
kedro.contrib.decorators.pyspark.pandas_to_spark |
|
Transformers | kedro.contrib.io.transformers.transformers |
Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader |
- Added the option to set/overwrite params in
config.yaml
using YAML dict style instead of string CLI formatting only. - Kedro CLI arguments
--node
and--tag
support comma-separated values, alternative methods will be deprecated in future releases. - Fixed a bug in the
invalidate_cache
method ofParquetGCSDataSet
andCSVGCSDataSet
. --load-version
now won't break if version value contains a colon.- Enabled running
node
s with duplicate inputs. - Improved error message when empty credentials are passed into
SparkJDBCDataSet
. - Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py
. - Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet
. - Fixed bug that caused unexpected behavior when using
from_nodes
andto_nodes
in pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy
(recommended alternative topandas.DataFrame.values
). - Docs improvements.
Pipeline.transform
skips modifying node inputs/outputs containingparams:
orparameters
keywords.- Support for
dataset_credentials
key in the credentials forPartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirm
function which is called after a successful node function execution if the node containsconfirms
argument with such dataset name. - Make the resume prompt on pipeline run failure use
--from-nodes
instead of--from-inputs
to avoid unnecessarily re-running nodes that had already executed. - When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeout
option to update it. - Added
kedro-viz
to the Kedro project templaterequirements.txt
file. - Removed the
results
andreferences
folder from the project template. - Updated contribution process in
CONTRIBUTING.md
.
- Existing
MatplotlibWriter
dataset incontrib
was renamed toMatplotlibLocalWriter
. kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez
- New CLI commands and command flags:
- Load multiple
kedro run
CLI flags from a configuration file with the--config
flag (e.g.kedro run --config run_config.yml
) - Run parametrised pipeline runs with the
--params
flag (e.g.kedro run --params param1:value1,param2:value2
). - Lint your project code using the
kedro lint
command, your project is linted withblack
(Python 3.6+),flake8
andisort
.
- Load multiple
- Load specific environments with Jupyter notebooks using
KEDRO_ENV
which will globally setrun
,jupyter notebook
andjupyter lab
commands using environment variables. - Added the following datasets:
CSVGCSDataSet
dataset incontrib
for working with CSV files in Google Cloud Storage.ParquetGCSDataSet
dataset incontrib
for working with Parquet files in Google Cloud Storage.JSONGCSDataSet
dataset incontrib
for working with JSON files in Google Cloud Storage.MatplotlibS3Writer
dataset incontrib
for saving Matplotlib images to S3.PartitionedDataSet
for working with datasets split across multiple files.JSONDataSet
dataset for working with JSON files that usesfsspec
to communicate with the underlying filesystem. It doesn't supporthttp(s)
protocol for now.
- Added
s3fs_args
to all S3 datasets. - Pipelines can be deducted with
pipeline1 - pipeline2
.
ParallelRunner
now works withSparkDataSet
.- Allowed the use of nulls in
parameters.yml
. - Fixed an issue where
%reload_kedro
wasn't reloading all user modules. - Fixed
pandas_to_spark
andspark_to_pandas
decorators to work with functions with kwargs. - Fixed a bug where
kedro jupyter notebook
andkedro jupyter lab
would run a different Jupyter installation to the one in the local environment. - Implemented Databricks-compatible dataset versioning for
SparkDataSet
. - Fixed a bug where
kedro package
would fail in certain situations wherekedro build-reqs
was used to generaterequirements.txt
. - Made
bucket_name
argument optional for the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
- bucket name can now be included into the filepath along with the filesystem protocol (e.g.s3://bucket-name/path/to/key.csv
). - Documentation improvements and fixes.
- Renamed entry point for running pip-installed projects to
run_package()
instead ofmain()
insrc/<package>/run.py
. bucket_name
key has been removed from the string representation of the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
.- Moved the
mem_profiler
decorator tocontrib
and separated thecontrib
decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported asfrom kedro.contrib.decorators.pyspark import <pyspark_decorator>
instead offrom kedro.contrib.decorators import <pyspark_decorator>
.
Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel
kedro jupyter
now gives the default kernel a sensible name.Pipeline.name
has been deprecated in favour ofPipeline.tags
.- Reuse pipelines within a Kedro project using
Pipeline.transform
, it simplifies dataset and node renaming. - Added Jupyter Notebook line magic (
%run_viz
) to runkedro viz
in a Notebook cell (requireskedro-viz
version 3.0.0 or later). - Added the following datasets:
NetworkXLocalDataSet
inkedro.contrib.io.networkx
to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSet
inkedro.contrib.io.pyspark.SparkHiveDataSet
allowing usage of Spark and insert/upsert on non-transactional Hive tables.
kedro.contrib.config.TemplatedConfigLoader
now supports name/dict key templating and default values.
get_last_load_version()
method for versioned datasets now returns exact last load version if the dataset has been loaded at least once andNone
otherwise.- Fixed a bug in
_exists
method for versionedSparkDataSet
. - Enabled the customisation of the ExcelWriter in
ExcelLocalDataSet
by specifying options underwriter
key insave_args
. - Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
- Removed capping the length of a dataset's string representation.
- Fixed
kedro install
command failing on Windows ifsrc/requirements.txt
contains a different version of Kedro. - Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e.
tags="my_tag"
).
- Removed
_check_paths_consistency()
method fromAbstractVersionedDataSet
. Version consistency check is now done inAbstractVersionedDataSet.save()
. Custom versioned datasets should modifysave()
method implementation accordingly.
Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
- Narrowed the requirements for
PyTables
so that we maintain support for Python 3.5.
- Added
--load-version
, akedro run
argument that allows you run the pipeline with a particular load version of a dataset. - Support for modular pipelines in
src/
, break the pipeline into isolated parts with reusability in mind. - Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with
kedro run --pipeline NAME
. - Added a
MatplotlibWriter
dataset incontrib
for saving Matplotlib images. - An ability to template/parameterize configuration files with
kedro.contrib.config.TemplatedConfigLoader
. - Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with
context.params
. - Added
max_workers
parameter forParallelRunner
.
- Users will override the
_get_pipeline
abstract method inProjectContext(KedroContext)
inrun.py
rather than thepipeline
abstract property. Thepipeline
property is not abstract anymore. - Improved an error message when versioned local dataset is saved and unversioned path already exists.
- Added
catalog
global variable to00-kedro-init.py
, allowing you to load datasets withcatalog.load()
. - Enabled tuples to be returned from a node.
- Disallowed the
ConfigLoader
loading the same file more than once, and deduplicated theconf_paths
passed in. - Added a
--open
flag tokedro build-docs
that opens the documentation on build. - Updated the
Pipeline
representation to include name of the pipeline, also making it readable as a context property. kedro.contrib.io.pyspark.SparkDataSet
andkedro.contrib.io.azure.CSVBlobDataSet
now support versioning.
KedroContext.run()
no longer acceptscatalog
andpipeline
arguments.node.inputs
now returns the node's inputs in the order required to bind them properly to the node's function.
Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
- Extended
versioning
support to cover the tracking of environment setup, code and datasets. - Added the following datasets:
FeatherLocalDataSet
incontrib
for usage with pandas. (by @mdomarsaleem)
- Added
get_last_load_version
andget_last_save_version
toAbstractVersionedDataSet
. - Implemented
__call__
method onNode
to allow for users to executemy_node(input1=1, input2=2)
as an alternative tomy_node.run(dict(input1=1, input2=2))
. - Added new
--from-inputs
run argument.
- Fixed a bug in
load_context()
not loading context in non-Kedro Jupyter Notebooks. - Fixed a bug in
ConfigLoader.get()
not listing nested files for**
-ending glob patterns. - Fixed a logging config error in Jupyter Notebook.
- Updated documentation in
03_configuration
regarding how to modify the configuration path. - Documented the architecture of Kedro showing how we think about library, project and framework components.
extras/kedro_project_loader.py
renamed toextras/ipython_loader.py
and now runs any IPython startup scripts without relying on the Kedro project structure.- Fixed TypeError when validating partial function's signature.
- After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.
Omar Saleem, Mariana Silva, Anil Choudhary, Craig
- Added
KedroContext
base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner). - Added a new CLI command
kedro jupyter convert
to facilitate converting Jupyter Notebook cells into Kedro nodes. - Added support for
pip-compile
and new Kedro commandkedro build-reqs
that generatesrequirements.txt
based onrequirements.in
. - Running
kedro install
will install packages to conda environment ifsrc/environment.yml
exists in your project. - Added a new
--node
flag tokedro run
, allowing users to run only the nodes with the specified names. - Added new
--from-nodes
and--to-nodes
run arguments, allowing users to run a range of nodes from the pipeline. - Added prefix
params:
to the parameters specified inparameters.yml
which allows users to differentiate between their different parameter node inputs and outputs. - Jupyter Lab/Notebook now starts with only one kernel by default.
- Added the following datasets:
CSVHTTPDataSet
to load CSV using HTTP(s) links.JSONBlobDataSet
to load json (-delimited) files from Azure Blob Storage.ParquetS3DataSet
incontrib
for usage with pandas. (by @mmchougule)CachedDataSet
incontrib
which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSet
incontrib
to load and save local YAML files. (by @Minyus)
- Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
anyconfig
default log level changed fromINFO
toWARNING
.- Added information on installed plugins to
kedro info
. - Added style sheets for project documentation, so the output of
kedro build-docs
will resemble the style ofkedro docs
.
- Simplified the Kedro template in
run.py
with the introduction ofKedroContext
class. - Merged
FilepathVersionMixIn
andS3VersionMixIn
under one abstract classAbstractVersionedDataSet
which extendsAbstractDataSet
. name
changed to be a keyword-only argument forPipeline
.CSVLocalDataSet
no longer supports URLs.CSVHTTPDataSet
supports URLs.
This guide assumes that:
- The framework specific code has not been altered significantly
- Your project specific code is stored in the dedicated python package under
src/
.
The breaking changes were introduced in the following project template files:
<project-name>/.ipython/profile_default/startup/00-kedro-init.py
<project-name>/kedro_cli.py
<project-name>/src/tests/test_run.py
<project-name>/src/<package-name>/run.py
<project-name>/.kedro.yml
(new file)
The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new
) and move code and files bit by bit as suggested in the detailed guide below:
-
Create a new project with the same name by running
kedro new
-
Copy the following folders to the new project:
results/
references/
notebooks/
logs/
data/
conf/
- If you customised your
src/<package>/run.py
, make sure you apply the same customisations tosrc/<package>/run.py
- If you customised
get_config()
, you can overrideconfig_loader
property inProjectContext
derived class - If you customised
create_catalog()
, you can overridecatalog()
property inProjectContext
derived class - If you customised
run()
, you can overriderun()
method inProjectContext
derived class - If you customised default
env
, you can override it inProjectContext
derived class or pass it at construction. By default,env
islocal
. - If you customised default
root_conf
, you can overrideCONF_ROOT
attribute inProjectContext
derived class. By default,KedroContext
base class hasCONF_ROOT
attribute set toconf
.
- The following syntax changes are introduced in ipython or Jupyter notebook/labs:
proj_dir
->context.project_path
proj_name
->context.project_name
conf
->context.config_loader
.io
->context.catalog
(e.g.,io.load()
->context.catalog.load()
)
-
If you customised your
kedro_cli.py
, you need to apply the same customisations to yourkedro_cli.py
in the new project. -
Copy the contents of the old project's
src/requirements.txt
into the new project'ssrc/requirements.in
and, from the project root directory, run thekedro build-reqs
command in your terminal window.
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
- Make sure your dataset inherits from
AbstractVersionedDataSet
only. - Call
super().__init__()
with the appropriate arguments in the dataset's__init__
. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in anexists_function
and aglob_function
that emulateexists
andglob
in a different filesystem (seeCSVS3DataSet
as an example). - Remove setting of the
_filepath
and_version
attributes in the dataset's__init__
, as this is taken care of in the base abstract class. - Any calls to
_get_load_path
and_get_save_path
methods should take no arguments. - Ensure you convert the output of
_get_load_path
and_get_save_path
appropriately, as these now returnPurePath
s instead of strings. - Make sure
_check_paths_consistency
is called withPurePath
s as input arguments, instead of strings.
These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami
- Tab completion for catalog datasets in
ipython
orjupyter
sessions. (Thank you @datajoely and @WaylonWalker) - Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
- Datasets have a new
release
function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.
- Add support for pipeline nodes made up from partial functions.
- Expand user home directory
~
for TextLocalDataSet (see issue #19). - Add a
short_name
property toNode
s for a display-friendly (but not necessarily unique) name. - Add Kedro project loader for IPython:
extras/kedro_project_loader.py
. - Fix source file encoding issues with Python 3.5 on Windows.
- Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.
- Remove the max_loads argument from the
MemoryDataSet
constructor and from theAbstractRunner.create_default_data_set
method.
Joel Schwarzmann, Alex Kalmikov
- Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.
- Merged the
ExistsMixin
intoAbstractDataSet
. Pipeline.node_dependencies
returns a dictionary keyed by node, with sets of parent nodes as values;Pipeline
andParallelRunner
were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.Pipeline.grouped_nodes
returns a list of sets, rather than a list of lists.
- New I/O module
HDFS3DataSet
.
- Improved API docs.
- Template
run.py
will throw a warning instead of error ifcredentials.yml
is not present.
None
The initial release of Kedro.
Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.