Skip to content

feat(aws): add s3 support to input, storage, output, cache, etc. #1830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

knguyen1
Copy link

@knguyen1 knguyen1 commented Mar 20, 2025

Description

This PR adds s3 integration to GraphRAG; support both AWS s3 and s3-like services (via endpoint_url; minio, etc.).

Related Issues

#1306

Proposed Changes

  • Add S3 pipeline storage implementation with full PipelineStorage interface support (graphrag/storage/s3_pipeline_storage.py)
  • Add S3 workflow callbacks for logging workflow events to S3 buckets (graphrag/callbacks/s3_workflow_callbacks.py)
  • Add S3 prompt loading capability for retrieving prompts directly from S3 buckets (graphrag/config/prompt_getter.py)
  • Add configuration support for S3 across all storage components (input, output, cache, reporting)
  • Add comprehensive documentation covering configuration, authentication options, and troubleshooting (docs/config/s3.md)
  • Add unit tests with mocked AWS services for all S3 components

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

  • Supports multiple authentication methods: explicit credentials, environment variables, AWS credential chain, and IAM roles
  • Compatible with S3-compatible storage services via configurable endpoint URLs
  • Implements lazy loading of S3 clients for improved performance
  • Includes proper error handling and logging for S3 operations
  • Storage paths are configurable via environment variables or YAML configuration
  • All S3 operations are thoroughly tested with mocked AWS services

@knguyen1 knguyen1 requested review from a team as code owners March 20, 2025 14:45
@knguyen1 knguyen1 changed the title Feat/add s3 support feat(aws): add s3 support to input, storage, output, cache, etc. Mar 20, 2025
@knguyen1
Copy link
Author

@microsoft-github-policy-service agree

@Sirorororo
Copy link

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

@knguyen1
Copy link
Author

knguyen1 commented Apr 9, 2025

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

Done: f1fd55d

@knguyen1
Copy link
Author

knguyen1 commented Apr 9, 2025

Please review @natoverse

@knguyen1 knguyen1 force-pushed the feat/add-s3-support branch from 30380b4 to 4dcc89d Compare April 10, 2025 06:44
@knguyen1 knguyen1 force-pushed the feat/add-s3-support branch from 6272869 to 4040d4b Compare April 24, 2025 13:07
@knguyen1
Copy link
Author

@natoverse @AlonsoGuevara review please?

@qcloop
Copy link

qcloop commented May 21, 2025

What is the status of this PR? We run out infra on AWS, so would be cool to have this functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants