Skip to content

[SPARK-52224][CONNECT][PYTHON] Introduce pyyaml as a dependency for the Python client #50944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented May 19, 2025

What changes were proposed in this pull request?

Introduces pyyaml as a dependency for the Python client. When pip install-ing the pyspark client, it will be installed with it.

Why are the changes needed?

The pipeline spec file described in the Declarative Pipelines SPIP expects data in a YAML format. YAML is superior to alternatives, for a few reasons: 

  • Unlike the flat files that are used for spark-submit confs, it supports the hierarchical data required by the pipeline spec.
  • It's much more user-friendly to author than JSON.
  • It's consistent with the config files used for similar tools, like dbt.

The Declarative Pipelines CLI will be a Spark Connect Python client, and thus require a Python library for loading YAML. The pyyaml library is an extremely stable dependency. The safe_load function that we'll use to load YAML files was introduced more than a decade ago.

Does this PR introduce any user-facing change?

Yes – users who pip install the PySpark client library will see the pyyaml library installed.

How was this patch tested?

  • Made a clean virtualenv
  • Ran pip install python/packaging/client
  • Confirmed that I could import yaml in a Python shell

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon HyukjinKwon changed the title [SPARK-52224] Introduce pyyaml as a dependency for the Python client [SPARK-52224][CONNECT][PYTHON Introduce pyyaml as a dependency for the Python client May 20, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-52224][CONNECT][PYTHON Introduce pyyaml as a dependency for the Python client [SPARK-52224][CONNECT][PYTHON] Introduce pyyaml as a dependency for the Python client May 20, 2025
Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need 2 more places to edit the setup.py

  1. python/packaging/connect/setup.py
  2. python/packaging/classic/setup.py -> here only for the connect additional requirements

@sryza
Copy link
Contributor Author

sryza commented May 20, 2025

Thanks @grundprinzip – I added those too

@sryza sryza requested a review from grundprinzip May 20, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants