Skip to content

spark_binary restricted in airflow spark 4.0.0? #30064

@ottomata

Description

@ottomata

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

4.0.0

Apache Airflow version

2.5.1

Operating System

Debian GNU/Linux 10 (buster)

Deployment

Other

Deployment details

No response

What happened

in airflow-providers-apache-spark 4.0.0, the value of spark_binary
was hardcoded to be restricted
to only either 'spark-submit' or 'spark2-submit'.

What was the reason for this? At the Wikimedia Foundation, we install the
spark 3 binary as 'spark3-submit'. This change in airflow spark 4.0.0 has broken
some of our dags, making us resort to things like this.

What you think should happen instead

We'd submit a patch to expand the restriction list to include 'spark3-submit', but we aren't sure why this was done in the first place. I understand the reasoning for removing spark_home, but it seems strange to have a spark_binary parameter and restrict it to these two values.

Can we undo this? If not, should we submit a patch to add spark3-submit to the list?

How to reproduce

Set spark_binary to 'spark3-submit'

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:providerskind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions