Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/re structure config docs #2421

Merged
merged 31 commits into from
Mar 28, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
98b780d
Add docs to explain registering a custom resolver
merelcht Mar 13, 2023
fedc118
Fix incorrect docs + add example for custom envs|
merelcht Mar 13, 2023
4e87359
Small clarifications
merelcht Mar 13, 2023
83eb348
Move kedro run config docs to running a pipeline
merelcht Mar 13, 2023
3962093
Merge branch 'main' into docs/improve-config-docs
merelcht Mar 13, 2023
d9d5601
Fix lint
merelcht Mar 13, 2023
2b26109
Re-order config paragraphs
merelcht Mar 13, 2023
bbd1009
Small simplifications
merelcht Mar 14, 2023
a8e8245
Add extra header
merelcht Mar 14, 2023
19cd27b
rename temp
merelcht Mar 14, 2023
b9fa2b9
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 14, 2023
af97bcc
Improve headers
merelcht Mar 14, 2023
557a919
clean up
merelcht Mar 14, 2023
7afca13
Divide config into basic and advanced + restructure basic content bet…
merelcht Mar 15, 2023
a18889d
Make config basics sections more complete + placeholder pages for cre…
merelcht Mar 16, 2023
b16d1d1
Convert explanations to how-tos advanced config
merelcht Mar 16, 2023
b686ec2
Re-arrange advanced topics
merelcht Mar 17, 2023
b26ffe2
Minor tweaks
merelcht Mar 17, 2023
917f54e
Add parameters and credentials pages
stichbury Mar 21, 2023
80809cc
Update basic config page
stichbury Mar 21, 2023
0a59f39
Another chunk of config docs improvements
stichbury Mar 21, 2023
775f7af
Merge branch 'main' into docs/re-structure-config-docs
stichbury Mar 21, 2023
ce4f5f1
Link to advanced how to's on basics and credentials pages
merelcht Mar 22, 2023
ed00281
Fix lint
merelcht Mar 22, 2023
1914132
Address review comments
merelcht Mar 23, 2023
617f96b
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 23, 2023
ee5952a
Fix typo in TemplatedConfigLoader sections
merelcht Mar 23, 2023
7155ac8
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 27, 2023
f994e28
Add links and update links to config pages
merelcht Mar 27, 2023
ce60f55
Update release notes
merelcht Mar 28, 2023
ed197d8
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Make config basics sections more complete + placeholder pages for cre…
…ds and params

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
  • Loading branch information
merelcht committed Mar 16, 2023
commit a18889d80464b7ed377749fea679673459b4b880
120 changes: 34 additions & 86 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Advanced configuration
...

### Configuration patterns

This logic is specified by `config_patterns` in the configuration loader classes. By default those patterns are set as follows for the configuration of catalog, parameters, logging and credentials:
Expand All @@ -11,92 +14,6 @@ config_patterns = {
}
```

The configuration patterns can be changed by setting the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](settings.md). You can change the default patterns as well as add additional ones, for example, for Spark configuration files.
This example shows how to load `parameters` if your files are using a `params` naming convention instead of `parameters` and how to add patterns to load Spark configuration:

```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*/"],
"parameters": ["params*", "params*/**", "**/params*"],
}
}
```

You can also bypass the configuration patterns and set configuration directly on the instance of a config loader class. You can bypass the default configuration (catalog, parameters, credentials, and logging) as well as additional configuration.

```python
from kedro.config import ConfigLoader
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path)

# Bypass configuration patterns by setting the key and values directly on the config loader instance.
conf_loader["catalog"] = {"catalog_config": "something_new"}
```

Configuration information from files stored in `base` or `local` that match these rules is merged at runtime and returned as a config dictionary:

* If any two configuration files located inside the same environment path (`conf/base/` or `conf/local/` in this example) contain the same top-level key, `load_config` will raise a `ValueError` indicating that duplicates are not allowed.

* If two configuration files have duplicate top-level keys but are in different environment paths (one in `conf/base/`, another in `conf/local/`, for example) then the last loaded path (`conf/local/` in this case) takes precedence and overrides that key value. `ConfigLoader.get` will not raise any errors - however, a `DEBUG` level log message will be emitted with information on the overridden keys.

When using the default `ConfigLoader` or the `TemplatedConfigLoader`, any top-level keys that start with `_` are considered hidden (or reserved) and are ignored after the config is loaded. Those keys will neither trigger a key duplication error nor appear in the resulting configuration dictionary. However, you can still use such keys, for example, as [YAML anchors and aliases](https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet#anchors).

### Additional configuration environments

In addition to the two built-in local and base configuration environments, you can create your own. Your project loads `conf/base/` as the bottom-level configuration environment but allows you to overwrite it with any other environments that you create, such as `conf/server/` or `conf/test/`. To use additional configuration environments, run the following command:

```bash
kedro run --env=<your-environment>
```

If no `env` option is specified, this will default to using the `local` environment to overwrite `conf/base`.

If you set the `KEDRO_ENV` environment variable to the name of your environment, Kedro will load that environment for your `kedro run`, `kedro ipython`, `kedro jupyter notebook` and `kedro jupyter lab` sessions:

```bash
export KEDRO_ENV=<your-environment>
```

```{note}
If you both specify the `KEDRO_ENV` environment variable and provide the `--env` argument to a CLI command, the CLI argument takes precedence.
```


#### Using only one configuration environment

If, for some reason, your project does not have any other environments apart from `base`, i.e. no `local` environment to default to, you must customise the configuration loader you're using to take `env="base"` in the constructor and then specify your custom config loader subclass in `src/<package_name>/settings.py` under the `CONFIG_LOADER_CLASS` key.
Below is an example of such a custom class. If you're using the `TemplatedConfigLoader` or the `OmegaConfigLoader` you need to use either of those as the class you are subclassing.

```python
# src/<package_name>/custom_config.py

from kedro.config import ConfigLoader
from typing import Any, Dict


class CustomConfigLoader(ConfigLoader):
def __init__(
self,
conf_source: str,
env: str = None,
runtime_params: Dict[str, Any] = None,
):
super().__init__(conf_source=conf_source, env="base", runtime_params=runtime_params)
```

And then you can import your `CustomConfigLoader` from `settings.py`:

```python
# settings.py
from package_name.custom_configloader import CustomConfigLoader

CONFIG_LOADER_CLASS = CustomConfigLoader
```


## Specify the configuration loader class

By default, Kedro is set up to use the [ConfigLoader](/kedro.config.ConfigLoader) class. Kedro also provides two additional configuration loaders with more advanced functionality: the [TemplatedConfigLoader](/kedro.config.TemplatedConfigLoader) and the [OmegaConfigLoader](/kedro.config.OmegaConfigLoader).
Expand Down Expand Up @@ -342,3 +259,34 @@ dev_s3:
```{note}
Note that you can only use the resolver in `credentials.yml` and not in catalog or parameter files. This is because we do not encourage the usage of environment variables for anything other than credentials.
```

## Advanced configuration how-tos

### How to change what configuration files are loaded?

### How to make sure non default configuration files get loaded?
The configuration patterns can be changed by setting the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](settings.md). You can change the default patterns as well as add additional ones, for example, for Spark configuration files.
This example shows how to load `parameters` if your files are using a `params` naming convention instead of `parameters` and how to add patterns to load Spark configuration:

```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*/"],
"parameters": ["params*", "params*/**", "**/params*"],
}
}
```

### How to bypass the configuration loading rules?
You can also bypass the configuration patterns and set configuration directly on the instance of a config loader class. You can bypass the default configuration (catalog, parameters, credentials, and logging) as well as additional configuration.

```python
from kedro.config import ConfigLoader
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path)

# Bypass configuration patterns by setting the key and values directly on the config loader instance.
conf_loader["catalog"] = {"catalog_config": "something_new"}
```
74 changes: 57 additions & 17 deletions docs/source/configuration/configuration_basics.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,40 @@
# Configuration
# Configuration basics

This section contains detailed information about Kedro project configuration. Project configuration is the configuration inside the [`/conf`](../get_started/kedro_concepts.md#conf) directory of your Kedro project.
By default, the files stored in this directory allow you to configure [parameters](configuration.md#parameters), [credentials](configuration.md#credentials), the [data catalog](../data/data_catalog.md), and [logging](../logging/logging.md).

Kedro makes use of a configuration loader to load any project configuration files. The available configuration loader classes are: [`ConfigLoader`](/kedro.config.ConfigLoader), [`TemplatedConfigLoader`](/kedro.config.TemplatedConfigLoader), and [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader).
By default, Kedro uses the `ConfigLoader`, for which the relevant API documentation can be found in [kedro.config.ConfigLoader](/kedro.config.ConfigLoader). In the following sections and examples, you can assume the default `ConfigLoader` is used, unless otherwise specified.

## Configuration source
The configuration source is the source folder where the Kedro project configuration is stored. We recommend that you keep all configuration files in the default `conf` directory of a Kedro project.

## Configuration basics
## Configuration environments
A configuration environment is a way of organising your configuration settings for different stages of your data pipeline. For example, you might have different settings for development, testing, and production environments.
By default, Kedro has a `local` and `base` environment.

### Configuration source
The configuration source is the source folder where the Kedro project configuration is stored. We recommend that you keep all configuration files in the default `conf` directory of a Kedro project.

### Configuration environments
...

#### Local
The `local` folder should be used for configuration that is either user-specific (e.g. IDE configuration) or protected (e.g. security keys).
### Local
The `local` configuration environment folder should be used for configuration that is either user-specific (e.g. IDE configuration) or protected (e.g. security keys).

```{note}
Please do not check in any local configuration to version control.
```

#### Base
The `base` folder is for shared configuration, such as non-sensitive and project-related configuration that may be shared across team members.
### Base
In Kedro, the base configuration environment refers to the default configuration settings that are used as the foundation for all other configuration environments in your pipeline.
This directory contains the default settings that are used across all environments in your pipeline, unless they are overridden by a specific environment.

```{warning}
```{note}
Do not put access credentials in the base configuration folder or any other configuration environment directory that is set up with version control.
```

### Configuration loading
Configuration information from files stored in `base` or `local` that match these rules is merged at runtime and returned as a config dictionary:
* If any two configuration files located inside the same environment path (`conf/base/` or `conf/local/` in the default setup) contain the same top-level key, the configuration loader will raise a `ValueError` indicating that duplicates are not allowed.
* If two configuration files have duplicate top-level keys but are in different environment paths (one in `conf/base/`, another in `conf/local/`, for example) then the last loaded path (`conf/local/` in this case) takes precedence and overrides that key value. `ConfigLoader.get` will not raise any errors - however, a `DEBUG` level log message will be emitted with information on the overridden keys.

When using the default `ConfigLoader` or the `TemplatedConfigLoader`, any top-level keys that start with `_` are considered hidden (or reserved) and are ignored after the config is loaded. Those keys will neither trigger a key duplication error nor appear in the resulting configuration dictionary. However, you can still use such keys, for example, as [YAML anchors and aliases](https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet#anchors).

## Configuration loading
Kedro-specific configuration (e.g., `DataCatalog` configuration for IO) is loaded using a configuration loader class, by default, the `ConfigLoader` class.
When you interact with Kedro through the command line, e.g. by running `kedro run`, Kedro will load all project configuration in the configuration source through this configuration loader.

Expand All @@ -42,7 +47,7 @@ Files will be matched according to file name and type rules. Suppose the config
* *And* file extension is one of the following: `yaml`, `yml`, `json`, `ini`, `pickle`, `xml` or `properties` for the `ConfigLoader` and `TemplatedConfigLoader` or `yaml`, `yml`, or `json` for the `OmegaConfigLoader`.


## Basic configuration how-tos
## Basic configuration how-tos

### How to change the configuration source folder?
If you prefer to store the Kedro project configuration in a different directory than the `conf` directory inside your project you can change the configuration source by setting the `CONF_SOURCE` variable in [`src/<package_name>/settings.py`](settings.md) as follows:
Expand Down Expand Up @@ -92,16 +97,51 @@ Note that for both the `tar.gz` and `zip` file the following structure is expect
└── README.md <-- optional but included with the default Kedro conf structure.
```

#### How to directly access configuration for e.g. debugging?
### How to directly access configuration for e.g. debugging?
If you want to directly access configuration in code, for example to debug, you can do so as follows:

```python
from kedro.config import ConfigLoader
from kedro.framework.project import settings

# Instantiate a ConfigLoader with the location of your project configuration.
conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path)

# This example shows how to access the catalog configuration.
# This line shows how to access the catalog configuration. You can access other configuration in the same way.
conf_catalog = conf_loader["catalog"]
```

### How to use additional configuration environments to base and local?
In addition to the two built-in local and base configuration environments, you can create your own. Your project loads `conf/base/` as the bottom-level configuration environment but allows you to overwrite it with any other environments that you create, such as `conf/server/` or `conf/test/`. To use additional configuration environments, run the following command:

```bash
kedro run --env=<your-environment>
```

If no `env` option is specified, this will default to using the `local` environment to overwrite `conf/base`.

If you set the `KEDRO_ENV` environment variable to the name of your environment, Kedro will load that environment for your `kedro run`, `kedro ipython`, `kedro jupyter notebook` and `kedro jupyter lab` sessions:

```bash
export KEDRO_ENV=<your-environment>
```

```{note}
If you both specify the `KEDRO_ENV` environment variable and provide the `--env` argument to a CLI command, the CLI argument takes precedence.
```

### How to change the default overriding environment?
By default, `local` is the overriding environment for `base`. If you want to change this, you must customise the configuration loader argument settings you're using to set `default_run_env` in `src/<package_name>/settings.py` under the `CONFIG_LOADER_ARGS` key.
For example, if you want to always override `base` configuration with configuration in a custom environment called `prod`, you change the configuration loader arguments like this:

```python
CONFIG_LOADER_ARGS = {"default_run_env": "prod"}
```

#### How to use only one configuration environment?
merelcht marked this conversation as resolved.
Show resolved Hide resolved
If, for some reason, your project does not have any other environments apart from `base`, i.e. no `local` environment to default to, you must customise the configuration loader argument settings you're using to set `"default_run_env": "base"` in `src/<package_name>/settings.py` under the `CONFIG_LOADER_ARGS` key.

```python
CONFIG_LOADER_ARGS = {"default_run_env": "base"}
```
47 changes: 47 additions & 0 deletions docs/source/configuration/credentials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## Credentials

For security reasons, we strongly recommend you to *not* commit any credentials or other secrets to the Version Control System. Hence, by default any file inside the `conf/` folder (and its subfolders) containing `credentials` in its name will be ignored via `.gitignore` and not committed to your git repository.

### Load credentials
Credentials configuration can be loaded the same way as any other project configuration using any of the configuration loader classes: `ConfigLoader`, `TemplatedConfigLoader`, and `OmegaConfigLoader`.
The following examples will all make use of the default `ConfigLoader` class.

```python
from kedro.config import ConfigLoader
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path, env="local")
credentials = conf_loader["credentials"]
```

This will load configuration files from `conf/base` and `conf/local` whose filenames start with `credentials`, or that are located inside a folder with a name that starts with `credentials`.

```{note}
Since `local` is set as the environment, the configuration path `conf/local` takes precedence in the example above. Hence, any overlapping top-level keys from `conf/base` will be overwritten by the ones from `conf/local`.
```

Calling `conf_loader[key]` in the example above throws a `MissingConfigException` error if no configuration files match the given key. If this is a valid workflow for your application, you can handle it as follows:

```python
from kedro.config import ConfigLoader, MissingConfigException
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path, env="local")

try:
credentials = conf_loader["credentials"]
except MissingConfigException:
credentials = {}
```

```{note}
The `kedro.framework.context.KedroContext` class uses the approach above to load project credentials.
```

Credentials configuration can then be used on its own or [fed into the `DataCatalog`](../data/data_catalog.md#feeding-in-credentials).

### AWS credentials

When you work with AWS credentials on datasets, you are not required to store AWS credentials in the project configuration files. Instead, you can specify them using environment variables `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and, optionally, `AWS_SESSION_TOKEN`. Please refer to the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) for more details.
Loading