`AWS_PROFILE` should be supported in cloud storage I/O config #18757

hutch3232 · 2024-09-15T17:28:12Z

Description

I have a variety of different AWS/S3 profiles in my ~/.aws/credentials and ~/.aws/config files. I'd like to be able to either explicitly pass profile into storage_options or implicitly by setting an AWS_PROFILE environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.

I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

polars seems to use the first profile listed in those ~/.aws files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first, pl.read_parquet("s3://my-bucket/my-parquet/*.parquet") would work, but being order-dependent is confusing and not scalable.

import polars as pl

pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
                storage_options={"profile": "my-profile"})

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
      2                 storage_options={"profile": "my-profile"})

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:184, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    181     source = [io.BytesIO(s) for s in source]  # type: ignore[arg-type, assignment]
    183 # For other inputs, defer to `scan_parquet`
--> 184 lf = scan_parquet(
    185     source,  # type: ignore[arg-type]
    186     n_rows=n_rows,
    187     row_index_name=row_index_name,
    188     row_index_offset=row_index_offset,
    189     parallel=parallel,
    190     use_statistics=use_statistics,
    191     hive_partitioning=hive_partitioning,
    192     hive_schema=hive_schema,
    193     try_parse_hive_dates=try_parse_hive_dates,
    194     rechunk=rechunk,
    195     low_memory=low_memory,
    196     cache=False,
    197     storage_options=storage_options,
    198     retries=retries,
    199     glob=glob,
    200     include_file_paths=None,
    201 )
    203 if columns is not None:
    204     if is_int_sequence(columns):

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:425, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, retries, include_file_paths)
    420 elif is_path_or_str_sequence(source):
    421     source = [
    422         normalize_filepath(source, check_not_directory=False) for source in source
    423     ]
--> 425 return _scan_parquet_impl(
    426     source,  # type: ignore[arg-type]
    427     n_rows=n_rows,
    428     cache=cache,
    429     parallel=parallel,
    430     rechunk=rechunk,
    431     row_index_name=row_index_name,
    432     row_index_offset=row_index_offset,
    433     storage_options=storage_options,
    434     low_memory=low_memory,
    435     use_statistics=use_statistics,
    436     hive_partitioning=hive_partitioning,
    437     hive_schema=hive_schema,
    438     try_parse_hive_dates=try_parse_hive_dates,
    439     retries=retries,
    440     glob=glob,
    441     include_file_paths=include_file_paths,
    442 )

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:476, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, retries, include_file_paths)
    472 else:
    473     # Handle empty dict input
    474     storage_options = None
--> 476 pylf = PyLazyFrame.new_from_parquet(
    477     source,
    478     sources,
    479     n_rows,
    480     cache,
    481     parallel,
    482     rechunk,
    483     parse_row_index_args(row_index_name, row_index_offset),
    484     low_memory,
    485     cloud_options=storage_options,
    486     use_statistics=use_statistics,
    487     hive_partitioning=hive_partitioning,
    488     hive_schema=hive_schema,
    489     try_parse_hive_dates=try_parse_hive_dates,
    490     retries=retries,
    491     glob=glob,
    492     include_file_paths=include_file_paths,
    493 )
    494 return wrap_ldf(pylf)

ComputeError: unknown configuration key: profile

FWIW this functionality exists in pandas and I'm hoping to migrate code to polars, but this is kind of essential.

The text was updated successfully, but these errors were encountered:

avimallu · 2024-09-16T12:41:39Z

I doubt Polars has control over object_store feature additions. I suggest you raise this request in their repo.

hutch3232 · 2024-09-16T13:00:28Z

Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer.

apache/arrow-rs#4238
apache/arrow-rs#4556

stevenmanton · 2024-09-25T18:48:58Z

Yikes. It looks like there's no easy way to get support for AWS profiles in polars, then. That's a big lack of functionality on the object_store package. My only workaround, then, is pl.read_parquet(..., use_pyarrow=True).

tustvold · 2024-09-27T14:20:40Z

👋 object_store maintainer here. The major challenge with supporting AWS_PROFILE is the sheer scope of such an initiative, even the official Rust AWS SDK continues to have issues in this space (awslabs/aws-sdk-rust#1193). Whilst we did at one point support AWS_PROFILE in object_store, it was tacked on and lead to surprising inconsistencies for users as only some of the configuration would be respected. We do not use SDKs as this allows for a more consistent experience across stores, especially since AWS is the only official one, along with a significantly smaller dependency footprint. There is more information on apache/arrow-rs#2176.

This support for AWS_PROFILE was therefore removed and replaced with a more flexible API allowing users and system integrators to configure how to source credentials from their environment. I have filed #18979 to suggest exposing this in polars.

Edit: As an aside I would strongly encourage using aws-vault to generate session credentials, as not only would it avoid this class of issue, but avoids storing credentials in plain text on the filesystem and relying on individual apps/tools to use the correct profile.

hutch3232 · 2024-09-30T20:02:24Z

One interesting thing I just realized is that pl.read_csv actually accepts the "profile" input to storage_options. That's surprising considering pl.read_parquet does not.

Edit: tested polars 1.8.2
Edit2: in fact, pl.read_csv can pick up AWS_PROFILE and even AWS_ENDPOINT_URL (see: #18758)

hutch3232 added the enhancement New feature or an improvement of an existing feature label Sep 15, 2024

avimallu mentioned this issue Sep 24, 2024

Issue reading S3 files #18907

Open

2 tasks

tustvold mentioned this issue Sep 27, 2024

Allow Overriding Object Store Credential Provider #18979

Open

tustvold mentioned this issue Sep 30, 2024

Use AWS Rust SDK To Source Credentials For S3 #19022

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

hutch3232 commented Sep 15, 2024

avimallu commented Sep 16, 2024

hutch3232 commented Sep 16, 2024

stevenmanton commented Sep 25, 2024

tustvold commented Sep 27, 2024 •

edited

Loading

hutch3232 commented Sep 30, 2024 •

edited

Loading

AWS_PROFILE should be supported in cloud storage I/O config #18757

AWS_PROFILE should be supported in cloud storage I/O config #18757

Comments

hutch3232 commented Sep 15, 2024

Description

avimallu commented Sep 16, 2024

hutch3232 commented Sep 16, 2024

stevenmanton commented Sep 25, 2024

tustvold commented Sep 27, 2024 • edited Loading

hutch3232 commented Sep 30, 2024 • edited Loading

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

tustvold commented Sep 27, 2024 •

edited

Loading

hutch3232 commented Sep 30, 2024 •

edited

Loading