Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS_PROFILE should be supported in cloud storage I/O config #18757

Open
hutch3232 opened this issue Sep 15, 2024 · 5 comments
Open

AWS_PROFILE should be supported in cloud storage I/O config #18757

hutch3232 opened this issue Sep 15, 2024 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@hutch3232
Copy link

Description

I have a variety of different AWS/S3 profiles in my ~/.aws/credentials and ~/.aws/config files. I'd like to be able to either explicitly pass profile into storage_options or implicitly by setting an AWS_PROFILE environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.

I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

polars seems to use the first profile listed in those ~/.aws files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first, pl.read_parquet("s3://my-bucket/my-parquet/*.parquet") would work, but being order-dependent is confusing and not scalable.

import polars as pl

pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
                storage_options={"profile": "my-profile"})

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
      2                 storage_options={"profile": "my-profile"})

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:184, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    181     source = [io.BytesIO(s) for s in source]  # type: ignore[arg-type, assignment]
    183 # For other inputs, defer to `scan_parquet`
--> 184 lf = scan_parquet(
    185     source,  # type: ignore[arg-type]
    186     n_rows=n_rows,
    187     row_index_name=row_index_name,
    188     row_index_offset=row_index_offset,
    189     parallel=parallel,
    190     use_statistics=use_statistics,
    191     hive_partitioning=hive_partitioning,
    192     hive_schema=hive_schema,
    193     try_parse_hive_dates=try_parse_hive_dates,
    194     rechunk=rechunk,
    195     low_memory=low_memory,
    196     cache=False,
    197     storage_options=storage_options,
    198     retries=retries,
    199     glob=glob,
    200     include_file_paths=None,
    201 )
    203 if columns is not None:
    204     if is_int_sequence(columns):

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:425, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, retries, include_file_paths)
    420 elif is_path_or_str_sequence(source):
    421     source = [
    422         normalize_filepath(source, check_not_directory=False) for source in source
    423     ]
--> 425 return _scan_parquet_impl(
    426     source,  # type: ignore[arg-type]
    427     n_rows=n_rows,
    428     cache=cache,
    429     parallel=parallel,
    430     rechunk=rechunk,
    431     row_index_name=row_index_name,
    432     row_index_offset=row_index_offset,
    433     storage_options=storage_options,
    434     low_memory=low_memory,
    435     use_statistics=use_statistics,
    436     hive_partitioning=hive_partitioning,
    437     hive_schema=hive_schema,
    438     try_parse_hive_dates=try_parse_hive_dates,
    439     retries=retries,
    440     glob=glob,
    441     include_file_paths=include_file_paths,
    442 )

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:476, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, retries, include_file_paths)
    472 else:
    473     # Handle empty dict input
    474     storage_options = None
--> 476 pylf = PyLazyFrame.new_from_parquet(
    477     source,
    478     sources,
    479     n_rows,
    480     cache,
    481     parallel,
    482     rechunk,
    483     parse_row_index_args(row_index_name, row_index_offset),
    484     low_memory,
    485     cloud_options=storage_options,
    486     use_statistics=use_statistics,
    487     hive_partitioning=hive_partitioning,
    488     hive_schema=hive_schema,
    489     try_parse_hive_dates=try_parse_hive_dates,
    490     retries=retries,
    491     glob=glob,
    492     include_file_paths=include_file_paths,
    493 )
    494 return wrap_ldf(pylf)

ComputeError: unknown configuration key: profile

FWIW this functionality exists in pandas and I'm hoping to migrate code to polars, but this is kind of essential.

@hutch3232 hutch3232 added the enhancement New feature or an improvement of an existing feature label Sep 15, 2024
@avimallu
Copy link
Contributor

I doubt Polars has control over object_store feature additions. I suggest you raise this request in their repo.

@hutch3232
Copy link
Author

Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer.

apache/arrow-rs#4238
apache/arrow-rs#4556

@stevenmanton
Copy link

Yikes. It looks like there's no easy way to get support for AWS profiles in polars, then. That's a big lack of functionality on the object_store package. My only workaround, then, is pl.read_parquet(..., use_pyarrow=True).

@tustvold
Copy link

tustvold commented Sep 27, 2024

👋 object_store maintainer here. The major challenge with supporting AWS_PROFILE is the sheer scope of such an initiative, even the official Rust AWS SDK continues to have issues in this space (awslabs/aws-sdk-rust#1193). Whilst we did at one point support AWS_PROFILE in object_store, it was tacked on and lead to surprising inconsistencies for users as only some of the configuration would be respected. We do not use SDKs as this allows for a more consistent experience across stores, especially since AWS is the only official one, along with a significantly smaller dependency footprint. There is more information on apache/arrow-rs#2176.

This support for AWS_PROFILE was therefore removed and replaced with a more flexible API allowing users and system integrators to configure how to source credentials from their environment. I have filed #18979 to suggest exposing this in polars.

Edit: As an aside I would strongly encourage using aws-vault to generate session credentials, as not only would it avoid this class of issue, but avoids storing credentials in plain text on the filesystem and relying on individual apps/tools to use the correct profile.

@hutch3232
Copy link
Author

hutch3232 commented Sep 30, 2024

One interesting thing I just realized is that pl.read_csv actually accepts the "profile" input to storage_options. That's surprising considering pl.read_parquet does not.

Edit: tested polars 1.8.2
Edit2: in fact, pl.read_csv can pick up AWS_PROFILE and even AWS_ENDPOINT_URL (see: #18758)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants