0.15.6
Major features and improvements
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
New features
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
New Datasets
Type | Description | Location |
---|---|---|
ParquetDataSet |
Handles parquet datasets using Dask | kedro.extras.datasets.dask |
PickleDataSet |
Work with Pickle files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pickle |
CSVDataSet |
Work with CSV files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
TextDataSet |
Work with text files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
ExcelDataSet |
Work with Excel files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
HDFDataSet |
Work with HDF using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
YAMLDataSet |
Work with YAML files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.yaml |
MatplotlibWriter |
Save with Matplotlib images using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.matplotlib |
NetworkXDataSet |
Work with NetworkX files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.networkx |
BioSequenceDataSet |
Work with bio-sequence objects using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.biosequence |
GBQTableDataSet |
Work with Google BigQuery | kedro.extras.datasets.pandas |
FeatherDataSet |
Work with feather files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
IncrementalDataSet |
Inherit from PartitionedDataSet and remembers the last processed partition |
kedro.io |
Files with a new location
Type | New Location |
---|---|
JSONDataSet |
kedro.extras.datasets.pandas |
CSVBlobDataSet |
kedro.extras.datasets.pandas |
JSONBlobDataSet |
kedro.extras.datasets.pandas |
SQLTableDataSet |
kedro.extras.datasets.pandas |
SQLQueryDataSet |
kedro.extras.datasets.pandas |
SparkDataSet |
kedro.extras.datasets.spark |
SparkHiveDataSet |
kedro.extras.datasets.spark |
SparkJDBCDataSet |
kedro.extras.datasets.spark |
kedro/contrib/decorators/retry.py |
kedro/extras/decorators/retry_node.py |
kedro/contrib/decorators/memory_profiler.py |
kedro/extras/decorators/memory_profiler.py |
kedro/contrib/io/transformers/transformers.py |
kedro/extras/transformers/time_profiler.py |
kedro/contrib/colors/logging/color_logger.py |
kedro/extras/logging/color_logger.py |
extras/ipython_loader.py |
tools/ipython/ipython_loader.py |
kedro/contrib/io/cached/cached_dataset.py |
kedro/io/cached_dataset.py |
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py |
kedro/io/data_catalog_with_default.py |
kedro/contrib/config/templated_config.py |
kedro/config/templated_config.py |
Upcoming deprecations
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet |
CSVGCSDataSet |
|
CSVHTTPDataSet |
|
CSVLocalDataSet |
|
CSVS3DataSet |
|
ExcelLocalDataSet |
|
FeatherLocalDataSet |
|
JSONGCSDataSet |
|
JSONLocalDataSet |
|
HDFLocalDataSet |
|
HDFS3DataSet |
|
kedro.contrib.io.cached.CachedDataSet |
|
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault |
|
MatplotlibLocalWriter |
|
MatplotlibS3Writer |
|
NetworkXLocalDataSet |
|
ParquetGCSDataSet |
|
ParquetLocalDataSet |
|
ParquetS3DataSet |
|
PickleLocalDataSet |
|
PickleS3DataSet |
|
TextLocalDataSet |
|
YAMLLocalDataSet |
|
Decorators | kedro.contrib.decorators.memory_profiler |
kedro.contrib.decorators.retry |
|
kedro.contrib.decorators.pyspark.spark_to_pandas |
|
kedro.contrib.decorators.pyspark.pandas_to_spark |
|
Transformers | kedro.contrib.io.transformers.transformers |
Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader |
Bug fixes and other changes
- Added the option to set/overwrite params in
config.yaml
using YAML dict style instead of string CLI formatting only. - Kedro CLI arguments
--node
and--tag
support comma-separated values, alternative methods will be deprecated in future releases. - Fixed a bug in the
invalidate_cache
method ofParquetGCSDataSet
andCSVGCSDataSet
. --load-version
now won't break if version value contains a colon.- Enabled running
node
s with duplicate inputs. - Improved error message when empty credentials are passed into
SparkJDBCDataSet
. - Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py
. - Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet
. - Fixed bug that caused unexpected behavior when using
from_nodes
andto_nodes
in pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy
(recommended alternative topandas.DataFrame.values
). - Docs improvements.
Pipeline.transform
skips modifying node inputs/outputs containingparams:
orparameters
keywords.- Support for
dataset_credentials
key in the credentials forPartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirm
function which is called after a successful node function execution if the node containsconfirms
argument with such dataset name. - Make the resume prompt on pipeline run failure use
--from-nodes
instead of--from-inputs
to avoid unnecessarily re-running nodes that had already executed. - When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeout
option to update it. - Added
kedro-viz
to the Kedro project templaterequirements.txt
file. - Removed the
results
andreferences
folder from the project template. - Updated contribution process in
CONTRIBUTING.md
.
Breaking changes to the API
- Existing
MatplotlibWriter
dataset incontrib
was renamed toMatplotlibLocalWriter
. kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.
Thanks for supporting contributions
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez