Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Decide new name for kedro-datasets optional dependencies #313

Closed
astrojuanlu opened this issue Aug 18, 2023 · 8 comments
Closed
Assignees

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Aug 18, 2023

The root cause of the problems we had last week (see #306) with the optional dependency group names in pyproject.toml is that kedro-datasets extras names just don't align with Python packaging standards: https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Provides-Extra (multiple use)
Changed in version 2.3: PEP 685 restricted valid values to be unambiguous (i.e. no normalization required).
...
A string containing the name of an optional feature. A valid name consists only of lowercase ASCII letters, ASCII numbers, and hyphen.

Therefore, pip install kedro-datasets[pandas.CSVDataSet] is out of line with current packaging standards.

From https://peps.python.org/pep-0685/:

When comparing extra names, tools MUST normalize the names being compared using the semantics outlined in PEP 503 for names:
re.sub(r"[-_.]+", "-", name).lower()
The core metadata specification will be updated such that the allowed names for Provides-Extra matches what PEP 508 specifies for names.

So, why was this working before? See also in the PEP:

Tools generating metadata MUST raise an error if a user specified two or more extra names which would normalize to the same name.

The reason is: setuptools hasn't fully implemented PEP 685 yet: pypa/setuptools#3586

So, our mangling of extras in setup.py before #263 worked, but after transitioning to self-referential extras, we fell victims of name normalization.

I'm voting against switching to Poetry, Hatch, PDM, or any other system that allows this behavior, because it's going to bite us in the future. We need a short-term solution (either revert #263 or go for @DimedS #307) and a long term solution (possibly deprecating extras names with dots and offer an alternative syntax for our users).

Originally posted by @astrojuanlu in #306 (comment)

@astrojuanlu astrojuanlu changed the title Decide and implement new name for optional dependencies Decide and implement new name for kedro-datasets optional dependencies Aug 18, 2023
@astrojuanlu
Copy link
Member Author

Notice that this is starting to creep up as we do dynamic inclusion of requirements.txt in projects pyproject.toml. This happened to me yesterday after installing a wheel coming from a kedro package:

❯ pip install ./spaceflights-namespaces/dist/spaceflights_namespaces-0.1-py3-none-any.whl                                                                                                                                                                           (advkedro310) 
Looking in indexes: https://juan_luis_cano%40mckinsey.com:****@mckinsey.jfrog.io/artifactory/api/pypi/python/simple
Processing ./spaceflights-namespaces/dist/spaceflights_namespaces-0.1-py3-none-any.whl
...
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-csvdataset'
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-exceldataset'
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-parquetdataset'

Contents of requirements.txt:

kedro-datasets[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]~=1.0

(coming from https://github.com/kedro-org/kedro-starters/blob/38e81fba49702d0c8851ad4dcd488bafd08f1a36/spaceflights/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/requirements.txt#L8)

@astrojuanlu
Copy link
Member Author

@astrojuanlu
Copy link
Member Author

Extras with illegal names are not being pulled anymore in pip 23.3 onwards #553

This was already marked as high, but now it's affecting users for real cc @noklam @ankatiyar

@astrojuanlu
Copy link
Member Author

The workaround for now is to tell users to install, say, kedro-datasets[pandas] instead of kedro-datasets[pandas.ParquetDataset], since the former is legal and the latter isn't. But still, this would pull lots of unneeded dependencies.

@noklam
Copy link
Contributor

noklam commented Feb 13, 2024

Sorry for the delayed response, and thank you for looking into this. Since you've been able to recreate the problem, it doesn't sound like you still need me to post the stacktrace. I've been using pipenv to manage my environments.
I'm actually a bit confused right now ... I just created a new kedro project. Piopenv sees the requirements.txt file and uses it. Previously this didn't seem to install pyarrow or fastparquet, and pandas displayed a warning message that one of those two packages was going to be a non-optional requirement. But my current project did in fact install pyarrow.

Another user have a problem. At first he has a bug where fastparquet is installed but ParquetDataset wouldn't work with a "file is closed" error. This is a legit bug itself.

But pyarrow should be the industry standard by now, he arrives this error because pip install kedor-datasets[optional] is missing the dependencies, and the error suggest user to install fastparquet or pyarrow.

@astrojuanlu
Copy link
Member Author

The key decision here is: do we still want to provide an optional group per-dataset?

If yes, we have to decide a naming scheme.
If no, that's a breaking change (kedro-datasets 3.0?) but also the current non-standard names don't work anyway. And besides, the dataset groups already have legal names and we don't need to change those (kedro-datasets[pandas] etc).

@merelcht merelcht moved this to To Do in Kedro Framework Feb 19, 2024
@merelcht merelcht changed the title Decide and implement new name for kedro-datasets optional dependencies [Spike] Decide new name for kedro-datasets optional dependencies Feb 19, 2024
@ankatiyar ankatiyar moved this from To Do to In Progress in Kedro Framework Feb 20, 2024
@astrojuanlu
Copy link
Member Author

astrojuanlu commented Feb 26, 2024

From the Airflow 2.8.2 release notes: https://github.com/apache/airflow/releases/tag/2.8.2

Airflow extras now get extras normalized to - (following PEP-685) instead of _ and .
(as it was before in some extras). When you install airflow with such extras (for example dbt.core or
all_dbs) you should use - instead of _ and ..

apache/airflow#36537

This is exactly our problem.

@astrojuanlu
Copy link
Member Author

Fixed by #570, will see the light in the next release 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants