Skip to content

Make pandas an optional dependency for amazon provider #28468

@manugarri

Description

@manugarri

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

No response

Apache Airflow version

latest

Operating System

any

Deployment

Other

Deployment details

No response

What happened

First of all, apologies if this is not the right section to post a GH issue. I looked for provider specific feature requests but couldnt find such section.

We use the aws provider at my company to interact from airflow with AWS services. We are using poetry for building the testing environment to test our dags.

However the build times are quite long, and the reason is building pandas, which is a dependency of the amazon provider.

By checking the provider's code, it seems pandas is used in a small minority of functions inside the provider:

./aws/transfers/hive_to_dynamodb.py:93:        data = hive.get_pandas_df(self.sql, schema=self.schema)

and

./aws/transfers/sql_to_s3.py:159:        data_df = sql_hook.get_pandas_df(sql=self.query, parameters=self.parameters)

Forcing every AWS Airflow user that do not use hive or want to turn sql into an s3 file to install pandas is a bit cumbersome.

What you think should happen instead

given how heavy the package is and how little is used in the amazon provider, pandas should be an optional dependency.

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions