Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example of catalog.list(<regex>) and replace io to catalog in docs #3924

Merged
merged 8 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 20 additions & 20 deletions docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ from kedro_datasets.pandas import (
ParquetDataset,
)

io = DataCatalog(
catalog = DataCatalog(
{
"bikes": CSVDataset(filepath="../data/01_raw/bikes.csv"),
"cars": CSVDataset(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
Expand All @@ -45,15 +45,15 @@ When using `SQLTableDataset` or `SQLQueryDataset` you must provide a `con` key c
To review the `DataCatalog`:

```python
io.list()
catalog.list()
```

## How to load datasets programmatically

To access each dataset by its name:

```python
cars = io.load("cars") # data is now loaded as a DataFrame in 'cars'
cars = catalog.load("cars") # data is now loaded as a DataFrame in 'cars'
gear = cars["gear"].values
```

Expand All @@ -78,9 +78,9 @@ To save data using an API similar to that used to load data:
from kedro.io import MemoryDataset

memory = MemoryDataset(data=None)
io.add("cars_cache", memory)
io.save("cars_cache", "Memory can store anything.")
io.load("cars_cache")
catalog.add("cars_cache", memory)
catalog.save("cars_cache", "Memory can store anything.")
catalog.load("cars_cache")
```

### How to save data to a SQL database for querying
Expand All @@ -96,18 +96,18 @@ try:
except FileNotFoundError:
pass

io.save("cars_table", cars)
catalog.save("cars_table", cars)

# rank scooters by their mpg
ranked = io.load("scooters_query")[["brand", "mpg"]]
ranked = catalog.load("scooters_query")[["brand", "mpg"]]
```

### How to save data in Parquet

To save the processed data in Parquet format:

```python
io.save("ranked", ranked)
catalog.save("ranked", ranked)
```

```{warning}
Expand Down Expand Up @@ -163,15 +163,15 @@ version = Version(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
catalog = DataCatalog({"test_dataset": test_dataset})

# save the dataset to data/01_raw/test.csv/<version>/test.csv
io.save("test_dataset", data1)
catalog.save("test_dataset", data1)
# save the dataset into a new file data/01_raw/test.csv/<version>/test.csv
io.save("test_dataset", data2)
catalog.save("test_dataset", data2)

# load the latest version from data/test.csv/*/test.csv
reloaded = io.load("test_dataset")
reloaded = catalog.load("test_dataset")
assert data2.equals(reloaded)
```

Expand All @@ -187,17 +187,17 @@ version = Version(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
catalog = DataCatalog({"test_dataset": test_dataset})

# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv
io.save("test_dataset", data1)
catalog.save("test_dataset", data1)
# load from data/01_raw/test.csv/my_exact_version/test.csv
reloaded = io.load("test_dataset")
reloaded = catalog.load("test_dataset")
assert data1.equals(reloaded)

# raises DatasetError since the path
# data/01_raw/test.csv/my_exact_version/test.csv already exists
io.save("test_dataset", data2)
catalog.save("test_dataset", data2)
```

We do not recommend passing exact load or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`.
Expand All @@ -220,11 +220,11 @@ version = Version(
test_dataset = CSVDataset(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_dataset": test_dataset})
catalog = DataCatalog({"test_dataset": test_dataset})

io.save("test_dataset", data1) # emits a UserWarning due to version inconsistency
catalog.save("test_dataset", data1) # emits a UserWarning due to version inconsistency

# raises DatasetError since the data/01_raw/test.csv/exact_load_version/test.csv
# file does not exist
reloaded = io.load("test_dataset")
reloaded = catalog.load("test_dataset")
```
122 changes: 88 additions & 34 deletions docs/source/notebooks_and_ipython/kedro_and_notebooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,20 @@

## Example project

The example adds a notebook to experiment with the retired [`pandas-iris` starter](https://github.com/kedro-org/kedro-starters/tree/main/pandas-iris). As an alternative, you can follow the example using a different starter, such as [`spaceflights-pandas`](https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pandas) or just add a notebook to your own project.
The example adds a notebook to experiment with the [`spaceflight-pandas-viz` starter](https://github.com/kedro-org/kedro-starters/tree/main/spaceflight-pandas-viz). As an alternative, you can follow the example using a different starter or just add a notebook to your own project.

Check warning on line 9 in docs/source/notebooks_and_ipython/kedro_and_notebooks.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/notebooks_and_ipython/kedro_and_notebooks.md#L9

[Kedro.words] Use '' instead of 'just'.
Raw output
{"message": "[Kedro.words] Use '' instead of 'just'.", "location": {"path": "docs/source/notebooks_and_ipython/kedro_and_notebooks.md", "range": {"start": {"line": 9, "column": 242}}}, "severity": "WARNING"}

We will assume the example project is called `iris`, but you can call it whatever you choose.
We will assume the example project is called `spaceflights`, but you can call it whatever you choose.

To create a project, you can run this command:
```bash
kedro new -n spaceflights --tools=viz --example=yes
```

You can find more options of `kedro new` from [Create a new Kedro Project](../get_started/new_project.md).

## Loading the project with `kedro jupyter notebook`

Navigate to the project directory (`cd iris`) and issue the following command in the terminal to launch Jupyter:
Navigate to the project directory (`cd spaceflights`) and issue the following command in the terminal to launch Jupyter:

```bash
kedro jupyter notebook
Expand Down Expand Up @@ -85,36 +92,56 @@
When you run the cell:

```ipython
['example_iris_data',
'parameters',
'params:example_test_data_ratio',
'params:example_num_train_iter',
'params:example_learning_rate'
[
'companies',
'reviews',
'shuttles',
'preprocessed_companies',
'preprocessed_shuttles',
'model_input_table',
'regressor',
'metrics',
'companies_columns',
'shuttle_passenger_capacity_plot_exp',
'shuttle_passenger_capacity_plot_go',
'dummy_confusion_matrix',
'parameters',
'params:model_options',
'params:model_options.test_size',
'params:model_options.random_state',
'params:model_options.features'
]
```

#### Search datasets with regex
If you do not remember the exact name of a dataset, you can provide a regular expression to search datasets.
```ipython
catalog.list("pre*")
```

When you run the cell:

```ipython
['preprocessed_companies', 'preprocessed_shuttles']
```
Next try the following for `catalog.load`:

```ipython
catalog.load("example_iris_data")
catalog.load("shuttles")
```

The output:

```ipython
INFO Loading data from 'example_iris_data' (CSVDataset)...

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[06/05/24 12:50:17] INFO Loading data from reviews (CSVDataset)...
Out[1]:

shuttle_id review_scores_rating review_scores_comfort ... review_scores_price number_of_reviews reviews_per_month
0 45163 91.0 10.0 ... 9.0 26 0.77
1 49438 96.0 10.0 ... 9.0 61 0.62
2 10750 97.0 10.0 ... 10.0 467 4.66
3 4146 95.0 10.0 ... 9.0 318 3.22

```

Now try the following:
Expand All @@ -127,13 +154,26 @@
```ipython
INFO Loading data from 'parameters' (MemoryDataset)...

{'example_test_data_ratio': 0.2,
'example_num_train_iter': 10000,
'example_learning_rate': 0.01}
{
'model_options': {
'test_size': 0.2,
'random_state': 3,
'features': [
'engines',
'passenger_capacity',
'crew',
'd_check_complete',
'moon_clearance_complete',
'iata_approved',
'company_rating',
'review_scores_rating'
]
}
}
```

```{note}
If you enable [versioning](../data/data_catalog.md#dataset-versioning) you can load a particular version of a dataset, e.g. `catalog.load("example_train_x", version="2021-12-13T15.08.09.255Z")`.
If you enable [versioning](../data/data_catalog.md#dataset-versioning) you can load a particular version of a dataset, e.g. `catalog.load("preprocessed_shuttles", version="2024-06-05T15.08.09.255Z")`.
```

### `context`
Expand All @@ -146,7 +186,7 @@
You should see output like this, according to your username and path:

```ipython
PosixPath('/Users/username/kedro_projects/iris')
PosixPath('/Users/username/kedro_projects/spaceflights')
```

You can find out more in the API documentation of {py:class}`~kedro.framework.context.KedroContext`.
Expand All @@ -163,10 +203,10 @@

```ipython
{'__default__': Pipeline([
Node(split_data, ['example_iris_data', 'parameters'], ['X_train', 'X_test', 'y_train', 'y_test'], 'split'),
Node(make_predictions, ['X_train', 'X_test', 'y_train'], 'y_pred', 'make_predictions'),
Node(report_accuracy, ['y_pred', 'y_test'], None, 'report_accuracy')
])}
Node(create_confusion_matrix, 'companies', 'dummy_confusion_matrix', None),
Node(preprocess_companies, 'companies', ['preprocessed_companies', 'companies_columns'], 'preprocess_companies_node'),
Node(preprocess_shuttles, 'shuttles', 'preprocessed_shuttles', 'preprocess_shuttles_node'),
...
```

You can use this to explore your pipelines and the nodes they contain:
Expand All @@ -177,7 +217,21 @@
Should give the output:

```ipython
{'y_pred', 'X_test', 'y_train', 'X_train', 'y_test'}
{
'X_train',
'regressor',
'shuttle_passenger_capacity_plot_exp',
'y_test',
'model_input_table',
'y_train',
'X_test',
'metrics',
'companies_columns',
'preprocessed_shuttles',
'preprocessed_companies',
'shuttle_passenger_capacity_plot_go',
'dummy_confusion_matrix'
}
```

### `session`
Expand Down Expand Up @@ -368,7 +422,7 @@
kedro jupyter lab
```

You can use any other Jupyter client to connect to a Kedro project kernel such as the [Qt Console](https://qtconsole.readthedocs.io/), which can be launched using the `kedro_iris` kernel as follows:
You can use any other Jupyter client to connect to a Kedro project kernel such as the [Qt Console](https://qtconsole.readthedocs.io/), which can be launched using the `spaceflights` kernel as follows:

```bash
jupyter qtconsole --kernel=kedro_iris
noklam marked this conversation as resolved.
Show resolved Hide resolved
Expand Down