Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document clearly thatkedro micropkg pull relies on sdists #2542

Closed
astrojuanlu opened this issue Apr 26, 2023 · 3 comments · Fixed by #2614
Closed

Document clearly thatkedro micropkg pull relies on sdists #2542

astrojuanlu opened this issue Apr 26, 2023 · 3 comments · Fixed by #2614
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@astrojuanlu
Copy link
Member

Description

The documentation for kedro micropkg pull lists three possible use cases:

  • kedro micropkg pull dist/<pipeline_name>-0.1-py3-none-any.tar.gz (local directory)
  • kedro micropkg pull s3://my_bucket/<pipeline_name>-0.1-py3-none-any.tar.gz (cloud storage)
  • kedro micropkg pull <pypi_package_name> (PyPI-like endpoint)

The last one relies on pip download --no-deps to download the corresponding source distributions:

python_call(
"pip", ["download", "--no-deps", "--dest", str(destination), location]
)

However, if the micropackage in question was published as a wheel, then pip download grabs that wheel and this code fails:

sdist_file = list(destination.glob("*.tar.gz"))
# `--no-deps` should fetch only one source distribution file, and CLI should fail if that's
# not the case.
if len(sdist_file) != 1:
file_names = [sf.name for sf in sdist_file]
raise KedroCliError(
f"More than 1 or no sdist files found: {file_names}. "
f"There has to be exactly one source distribution file."
)

Steps to Reproduce

kedro new --starter=spaceflights
cd spaceflights
pip install -r src/requirements.txt

kedro micropkg package pipelines.data_processing

pushd dist && pip wheel *.tar.gz && popd

twine upload --repository-url http://0.0.0.0:8080/ dist/*

kedro micropkg pull data_processing

Expected Result

The wheel is properly used.

Actual Result

> PIP_TRUSTED_HOST=0.0.0.0 PIP_INDEX_URL=http://0.0.0.0:8080 kedro micropkg pull data_processing
/Users/juan_cano/.micromamba/envs/_test38/bin/python3.8 -m pip download --no-deps --dest /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/tmpq275beek data_processing
Looking in indexes: http://0.0.0.0:8080
Collecting data_processing
  Downloading http://0.0.0.0:8080/packages/data_processing-0.1-py3-none-any.whl (2.7 kB)
Saved /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/tmpq275beek/data_processing-0.1-py3-none-any.whl
Successfully downloaded data_processing
kedro.framework.cli.utils.KedroCliError: More than 1 or no sdist files found: []. There has to be exactly one source distribution file.
Run with --verbose to see the full exception
Error: More than 1 or no sdist files found: []. There has to be exactly one source distribution file.

Workaround

Using the --no-binary=:all: option in pip download works:

> PIP_NO_BINARY=:all: PIP_TRUSTED_HOST=0.0.0.0 PIP_INDEX_URL=http://0.0.0.0:8080 kedro micropkg pull data_processing
/Users/juan_cano/.micromamba/envs/_test38/bin/python3.8 -m pip download --no-deps --dest /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/tmpfc66u936 data_processing
Looking in indexes: http://0.0.0.0:8080
Collecting data_processing
  Downloading http://0.0.0.0:8080/packages/data_processing-0.1.tar.gz (2.0 kB)
  Preparing metadata (setup.py) ... done
Saved /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/tmpfc66u936/data_processing-0.1.tar.gz
Successfully downloaded data_processing
Creating '/private/tmp/spaceflights/conf/base/parameters': 
  Creating '/private/tmp/spaceflights/conf/base/parameters/data_processing.yml': SKIPPED (already exists)
[04/26/23 18:45:39] WARNING  /Users/juan_cano/.micromamba/envs/_test38/lib/python3.8/site-packages/rope/base/project.py:225: DeprecationWarning:    warnings.py:109
                             Delete once deprecated functions are gone                                                                                             
                               self._init_source_folders()                                                                                                         
                                                                                                                                                                   
Creating '/private/tmp/spaceflights/src/spaceflights/data_processing/__init__.py': OK
Creating '/private/tmp/spaceflights/src/spaceflights/data_processing/pipeline.py': OK
Creating '/private/tmp/spaceflights/src/spaceflights/data_processing/nodes.py': OK
Micro-package data_processing pulled and unpacked!

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): kedro 0.18.7
  • Python version used (python -V): 3.8.16
  • Operating system and version: macOS Ventura
  • pip version: 23.1.2
@astrojuanlu astrojuanlu added the Issue: Bug Report 🐞 Bug that needs to be fixed label Apr 26, 2023
@astrojuanlu
Copy link
Member Author

While working on gh-2273 I realized that kedro package produces exclusively a wheel, whereas kedro micropkg produces exclusively a sdist. There is some old discussion at https://github.com/quantumblacklabs/private-kedro/issues/1150#issuecomment-879781083 (private link) as to why sdists were chosen for micropackages, which was echoed in #1304, but I'd need to dive a bit deeper to fully understand the decision.

@merelcht merelcht added the Component: Documentation 📄 Issue/PR for markdown and API documentation label May 9, 2023
@merelcht merelcht changed the title kedro micropkg pull relies on sdists that might not be there Document clearly thatkedro micropkg pull relies on sdists May 9, 2023
@merelcht merelcht removed the Issue: Bug Report 🐞 Bug that needs to be fixed label May 9, 2023
@astrojuanlu
Copy link
Member Author

From another user report https://www.linen.dev/s/kedro/t/11968557/hi-team-question-on-kedro-pull-micropkg-and-sdist-here-would#553527b9-37fd-47c0-b289-b667dfc9bfa4, I think there are three ways we could improve the code, apart from the documentation, without changing its behavior:

  • Add --no-binary :all: to pip download, so only sdists are downloaded.
  • If pip download fails, raise an appropriate error (like "package download failed", or "package not found").
  • In the error message, be more explicit on whether none or > 1 sdists were found, to help debugging.

@astrojuanlu
Copy link
Member Author

More broadly, there's a strong coupling between what kedro micropkg package produces and what kedro micropkg pull consumes, and we should warn users about that. Specifically, kedro micropkg pull assumes

  • sdist,
  • with a flat (non src) layout,
  • using setup.py (hence setuptools)

astrojuanlu added a commit that referenced this issue May 29, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue May 29, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue May 29, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue May 29, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu astrojuanlu self-assigned this May 29, 2023
@astrojuanlu astrojuanlu moved this to In Progress in Kedro Framework May 29, 2023
astrojuanlu added a commit that referenced this issue May 30, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue Jun 5, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue Jun 5, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
astrojuanlu added a commit that referenced this issue Jun 5, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@merelcht merelcht removed the status in Kedro Framework Jun 12, 2023
astrojuanlu added a commit that referenced this issue Jun 13, 2023
Fix gh-2542.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu astrojuanlu moved this to In Review in Kedro Framework Jun 13, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants