Skip to content

Commit

Permalink
Rename all mentions in docs of DataSet to Dataset (#3148)
Browse files Browse the repository at this point in the history
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
  • Loading branch information
3 people authored Oct 10, 2023
1 parent 2bc1cbc commit bb61b17
Show file tree
Hide file tree
Showing 30 changed files with 293 additions and 294 deletions.
25 changes: 12 additions & 13 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,16 +112,15 @@
"typing.Type",
"typing.Set",
"kedro.config.config.ConfigLoader",
"kedro.io.core.AbstractDataSet",
"kedro.io.core.AbstractVersionedDataSet",
"kedro.io.core.DataSetError",
"kedro.io.core.AbstractDataset",
"kedro.io.core.AbstractVersionedDataset",
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
"kedro.io.memory_dataset.MemoryDataSet",
"kedro.io.partitioned_dataset.PartitionedDataSet",
"kedro.io.memory_dataset.MemoryDataset",
"kedro.io.partitioned_dataset.PartitionedDataset",
"kedro.pipeline.pipeline.Pipeline",
"kedro.runner.runner.AbstractRunner",
"kedro.runner.parallel_runner._SharedMemoryDataSet",
"kedro.runner.parallel_runner._SharedMemoryDataset",
"kedro.framework.context.context.KedroContext",
"kedro.framework.startup.ProjectMetadata",
Expand All @@ -136,7 +135,7 @@
"CONF_SOURCE",
"integer -- return number of occurrences of value",
"integer -- return first index of value.",
"kedro_datasets.pandas.json_dataset.JSONDataSet",
"kedro_datasets.pandas.json_dataset.JSONDataset",
"pluggy._manager.PluginManager",
"PluginManager",
"_DI",
Expand Down Expand Up @@ -165,7 +164,7 @@
"ValueError",
"BadConfigException",
"MissingConfigException",
"DataSetError",
"DatasetError",
"ImportError",
"KedroCliError",
"Exception",
Expand Down Expand Up @@ -347,16 +346,16 @@ def autolink_replacements(what: str) -> list[tuple[str, str, str]]:
is a reStructuredText link to their documentation.
For example, if the docstring reads:
This LambdaDataSet loads and saves ...
This LambdaDataset loads and saves ...
Then the word ``LambdaDataSet``, will be replaced by
:class:`~kedro.io.LambdaDataSet`
Then the word ``LambdaDataset``, will be replaced by
:class:`~kedro.io.LambdaDataset`
Works for plural as well, e.g:
These ``LambdaDataSet``s load and save
These ``LambdaDataset``s load and save
Will convert to:
These :class:`kedro.io.LambdaDataSet` load and save
These :class:`kedro.io.LambdaDataset` load and save
Args:
what: The objects to create replacement tuples for. Possible values
Expand Down
22 changes: 11 additions & 11 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ bucket_name: "my_s3_bucket"
key_prefix: "my/key/prefix/"

datasets:
csv: "pandas.CSVDataSet"
spark: "spark.SparkDataSet"
csv: "pandas.CSVDataset"
spark: "spark.SparkDataset"

folders:
raw: "01_raw"
Expand Down Expand Up @@ -99,7 +99,7 @@ Alternatively, you can declare which values to fill in the template through a di
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"datasets": {"csv": "pandas.CSVDataset", "spark": "spark.SparkDataset"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
Expand All @@ -117,7 +117,7 @@ CONFIG_LOADER_ARGS = {
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"datasets": {"csv": "pandas.CSVDataset", "spark": "spark.SparkDataset"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
Expand Down Expand Up @@ -185,7 +185,7 @@ From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https://
type: MemoryDataset
{{ speed }}-cars:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: s3://${bucket_name}/{{ speed }}-cars.csv
save_args:
index: true
Expand All @@ -205,13 +205,13 @@ The output Python dictionary will look as follows:
{
"fast-trains": {"type": "MemoryDataset"},
"fast-cars": {
"type": "pandas.CSVDataSet",
"type": "pandas.CSVDataset",
"filepath": "s3://my_s3_bucket/fast-cars.csv",
"save_args": {"index": True},
},
"slow-trains": {"type": "MemoryDataset"},
"slow-cars": {
"type": "pandas.CSVDataSet",
"type": "pandas.CSVDataset",
"filepath": "s3://my_s3_bucket/slow-cars.csv",
"save_args": {"index": True},
},
Expand Down Expand Up @@ -260,7 +260,7 @@ companies:
and a file containing the template values called `catalog_globals.yml`:
```yaml
_pandas:
type: pandas.CSVDataSet
type: pandas.CSVDataset
```

Since both of the file names (`catalog.yml` and `catalog_globals.yml`) match the config pattern for catalogs, the `OmegaConfigLoader` will load the files and resolve the placeholders correctly.
Expand All @@ -279,7 +279,7 @@ Suppose you have global variables located in the file `conf/base/globals.yml`:
```yaml
my_global_value: 45
dataset_type:
csv: pandas.CSVDataSet
csv: pandas.CSVDataset
```
You can access these global variables in your catalog or parameters config files with a `globals` resolver like this:
`conf/base/parameters.yml`:
Expand Down Expand Up @@ -318,7 +318,7 @@ kedro run --params random=3
You can also specify a default value to be used in case the runtime parameter is not specified with the `kedro run` command. Consider this catalog entry:
```yaml
companies:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: "${runtime_params:folder, 'data/01_raw'}/companies.csv"
```
If the `folder` parameter is not passed through the CLI `--params` option with `kedro run`, the default value `'data/01_raw/'` is used for the `filepath`.
Expand Down Expand Up @@ -366,7 +366,7 @@ types to the catalog entry.

```yaml
my_polars_dataset:
type: polars.CSVDataSet
type: polars.CSVDataset
filepath: data/01_raw/my_dataset.csv
load_args:
dtypes:
Expand Down
12 changes: 6 additions & 6 deletions docs/source/configuration/config_loader_migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,8 +132,8 @@ Suppose you are migrating a templated **catalog** file from using `TemplatedConf

- datasets:
+ _datasets:
csv: "pandas.CSVDataSet"
spark: "spark.SparkDataSet"
csv: "pandas.CSVDataset"
spark: "spark.SparkDataset"

```

Expand Down Expand Up @@ -175,8 +175,8 @@ bucket_name: "my_s3_bucket"
key_prefix: "my/key/prefix/"

datasets:
csv: "pandas.CSVDataSet"
spark: "spark.SparkDataSet"
csv: "pandas.CSVDataset"
spark: "spark.SparkDataset"

folders:
raw: "01_raw"
Expand Down Expand Up @@ -218,11 +218,11 @@ If you take the example from [the `TemplatedConfigLoader` with Jinja2 documentat
- {% for speed in ['fast', 'slow'] %}
- {{ speed }}-trains:
+ "{speed}-trains":
type: MemoryDataSet
type: MemoryDataset
- {{ speed }}-cars:
+ "{speed}-cars":
type: pandas.CSVDataSet
type: pandas.CSVDataset
- filepath: s3://${bucket_name}/{{ speed }}-cars.csv
+ filepath: s3://${bucket_name}/{speed}-cars.csv
save_args:
Expand Down
30 changes: 15 additions & 15 deletions docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,29 @@ In the following code, we use several pre-built data loaders documented in the [
```python
from kedro.io import DataCatalog
from kedro_datasets.pandas import (
CSVDataSet,
SQLTableDataSet,
SQLQueryDataSet,
ParquetDataSet,
CSVDataset,
SQLTableDataset,
SQLQueryDataset,
ParquetDataset,
)

io = DataCatalog(
{
"bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
"cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
"cars_table": SQLTableDataSet(
"bikes": CSVDataset(filepath="../data/01_raw/bikes.csv"),
"cars": CSVDataset(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
"cars_table": SQLTableDataset(
table_name="cars", credentials=dict(con="sqlite:///kedro.db")
),
"scooters_query": SQLQueryDataSet(
"scooters_query": SQLQueryDataset(
sql="select * from cars where gear=4",
credentials=dict(con="sqlite:///kedro.db"),
),
"ranked": ParquetDataSet(filepath="ranked.parquet"),
"ranked": ParquetDataset(filepath="ranked.parquet"),
}
)
```

When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only).
When using `SQLTableDataset` or `SQLQueryDataset` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataset` only).

## How to view the available data sources

Expand Down Expand Up @@ -130,7 +130,7 @@ my_gcp_credentials:
Your code will look as follows:
```python
CSVDataSet(
CSVDataset(
filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv",
load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]),
credentials=dict(key="token", secret="key"),
Expand All @@ -145,7 +145,7 @@ If you require programmatic control over load and save versions of a specific da

```python
from kedro.io import DataCatalog, Version
from kedro_datasets.pandas import CSVDataSet
from kedro_datasets.pandas import CSVDataset
import pandas as pd

data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
Expand All @@ -155,7 +155,7 @@ version = Version(
save=None, # generate save version automatically on each save operation
)

test_dataset = CSVDataSet(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
Expand All @@ -179,7 +179,7 @@ version = Version(
save="my_exact_version", # save to exact version
)

test_dataset = CSVDataSet(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
Expand Down Expand Up @@ -212,7 +212,7 @@ version = Version(
save="my_data_20230818.csv", # save to exact version
)

test_dataset = CSVDataSet(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
Expand Down
18 changes: 9 additions & 9 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ The example below registers two `csv` datasets, and an `xlsx` dataset. The minim

```yaml
companies:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv

reviews:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/reviews.csv

shuttles:
type: pandas.ExcelDataSet
type: pandas.ExcelDataset
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
Expand Down Expand Up @@ -63,7 +63,7 @@ For example, to load or save a CSV on a local file system, using specified load/

```yaml
cars:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/company/cars.csv
load_args:
sep: ','
Expand Down Expand Up @@ -116,7 +116,7 @@ and the Data Catalog is specified in `catalog.yml` as follows:

```yaml
motorbikes:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3
load_args:
Expand All @@ -132,7 +132,7 @@ Kedro enables dataset and ML model versioning through the `versioned` definition

```yaml
cars:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/company/cars.csv
versioned: True
```
Expand All @@ -148,7 +148,7 @@ where `--load-version` is dataset name and version timestamp separated by `:`.

A dataset offers versioning support if it extends the [`AbstractVersionedDataset`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively.

To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataset`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.
To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataset`. For instance, if you encounter a class like `CSVDataset(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.

```{note}
Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning.
Expand All @@ -166,12 +166,12 @@ To illustrate this, consider the following catalog entry for a dataset named `ca
```yaml
cars:
filepath: s3://my_bucket/cars.csv
type: pandas.CSVDataSet
type: pandas.CSVDataset
```
You can overwrite this catalog entry in `conf/local/catalog.yml` to point to a locally stored file instead:
```yaml
cars:
filepath: data/01_raw/cars.csv
type: pandas.CSVDataSet
type: pandas.CSVDataset
```
In your pipeline code, when the `cars` dataset is used, it will use the overwritten catalog entry from `conf/local/catalog.yml` and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline.
Loading

0 comments on commit bb61b17

Please sign in to comment.