Skip to content

Reduce memory usage in S3Hook #35449

@Taragolis

Description

@Taragolis

Body

Original stacktrace from the Slack

Error:
 File "/usr/local/airflow/plugins/plugins/others/data_source_monitor.py", line 53, in retrieve_data
get_time_query = s3_hook.read_key(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 514, in read_key
obj = self.get_key(key, bucket_name)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 493, in get_key
s3_resource = self.get_session().resource(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/boto3/session.py", line 446, in resource
client = self.client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/boto3/session.py", line 299, in client
return self._session.create_client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/session.py", line 976, in create_client
client = client_creator.create_client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 116, in create_client
endpoints_ruleset_data = self._load_service_endpoints_ruleset(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 220, in _load_service_endpoints_ruleset
return self._loader.load_service_model(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 142, in _wrapper
data = func(self, *args, **kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 406, in load_service_model
known_services = self.list_available_services(type_name)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 142, in _wrapper
data = func(self, *args, **kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 311, in list_available_services
api_versions = os.listdir(full_dirname)
OSError: [Errno 12] Cannot allocate memory: '/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/data/efs'

The reason of this error simple, for some operations S3Hook create resource (High Level client) in addition to S3.Client and this resource created every time when some method of S3Hook called as result additional memory required, for example if run S3Hook.download_file into the loop it might be reason for this error

As usual there are at least two solutions:
Option 1: use caching into the internal methods of S3Hook
Option 2: Get rid of resource usage in S3 hook and replace it by S3.Client methods. It might be better solution:

  • Seems like resources do not actively maintained in boto3
  • It required for about 30-40 MB of memory for create new resource object, however everything (and even more) could be done by S3.Client

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.

Metadata

Metadata

Assignees

Labels

area:providerskind:metaHigh-level information important to the communityprovider:amazonAWS/Amazon - related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions