Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend path-like keys in the catalog #1934

Open
PetitLepton opened this issue Oct 15, 2022 · 4 comments
Open

Extend path-like keys in the catalog #1934

PetitLepton opened this issue Oct 15, 2022 · 4 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@PetitLepton
Copy link
Contributor

PetitLepton commented Oct 15, 2022

Description

In the current implementation of the parsing of the context, the function _convert_paths_to_absolute_posix — which converts relative to absolute paths — restrict the conversion to three types of keys, namely

    # only check a few conf keys that are known to specify a path string as value
    conf_keys_with_filepath = ("filename", "filepath", "path")

This is very restrictive and can lead to unexpected issues while implementing custom data set as explained in the example below.

Context

Let's imagine that I want to extend the SQLQueryDataSet by using a Jinja template and feed it with parameters. I would have done something like the following

class TemplatedSQLQueryDataSet(SQLQueryDataSet):
    def __init__(
        self,
        templated_query_filepath: str,
        parameters_path: str,
        credentials: Optional[Dict[str, Any]] = None,
        load_args: Optional[Dict[str, Any]] = None,
        fs_args: Optional[Dict[str, Any]] = None,
    ) -> None:

        if parameters_path is not None:
            self._query_parameters = JSONDataSet(filepath=parameters_path).load()

        super().__init__(
            "",
            credentials or {},
            load_args or {},
            fs_args or {},
            templated_query_filepath,
        )

    def _load(self) -> pd.DataFrame:
        load_args = copy.deepcopy(self._load_args)
        engine = self.engines[self._connection_str]  # type: ignore

        if self._filepath:
            load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
            with self._fs.open(load_path, mode="r") as fs_file:
                template = fs_file.read()

        load_args["sql"] = Template(source=template).render(**self._query_parameters)
        return pd.read_sql_query(con=engine, **load_args)

While this works when running a pipeline or using kedro ipython, it potentially breaks when used in a notebook under notebooks with a catalog entry like

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  templated_query_filepath: data/01_raw/template.sql
  parameters_path: data/03_primary/parameters.json

because the relative paths are not properly parsed when loading the context.

One can solve this particular example by replacing the two paths by filepath and path for example which belong to the list of parsed keys mentioned at the top of this issue. In terms of user experience, it is nevertheless pretty unintuitive. Another possibility is to have nested elements like "template": {"filepath": ...}.

Possible Implementation

Could we use something like a regex containing path instead of a rigid list?

Thanks in advance!

@PetitLepton PetitLepton added the Issue: Feature Request New feature or improvement to existing feature label Oct 15, 2022
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Oct 17, 2022
@merelcht
Copy link
Member

Hi @PetitLepton, we haven't heard this request before, but we'd be happy to accept a contribution for it 🙂

@astrojuanlu
Copy link
Member

Coming back to this. The offending code is

# only check a few conf keys that are known to specify a path string as value
conf_keys_with_filepath = ("filename", "filepath", "path")

We spoke about this at length when debating #2965 and related "Kedro as a library" issues.

The current hardcoded list is less than ideal, but this "magic" behavior is deeply ingrained in Kedro at the moment, and the only realistic way to move past it is to introduce variables in the filepaths so that the user is in control:

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  filepath: ${kedro:project_root}/data/01_raw/template.sql

I think the proposed solution, namely

Could we use something like a regex containing path instead of a rigid list?

Goes in the opposite direction of expanding this behavior.

If you want to avoid mangling there's a workaround which is not using filepath. If you do want some special mangling, you can potentially achieve it using a custom resolver (haven't actually verified it, so if it turns out it's impossible, would be useful to know).

This probably deserves some more discussion.

@AhdraMeraliQB
Copy link
Contributor

Hi @PetitLepton, @astrojuanlu - I believe #3561 resolves this issue and gets rid of the hardcoding. We now use relative paths across the board, does the example above still have issues?

@astrojuanlu
Copy link
Member

I think this issue is still there:

# only check a few conf keys that are known to specify a path string as value
conf_keys_with_filepath = ("filename", "filepath", "path")

Recently I saw a user doing lots of

dataset:
  type: ...
  filepath: ${base_path}/dir/file.parquet

so it's a pattern that our users are putting in the wild now.

Instead of "let's extend or make configurable the keys that receive a special treatment", I think we should find ways to support something like this:

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  filepath: ${kedro:project_root}/data/01_raw/template.sql

which would also solve @PetitLepton original request in #1934 (comment).

This essentially goes back to option 3 in our original debate #2924 (comment), imitating what Intake does https://intake.readthedocs.io/en/latest/catalog.html#yaml-format, and @yetudada's question here #2819 (comment) which we addressed by adding more docs #3247.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

4 participants