Skip to content

feat: Implement support for multiple storage backends #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

FrithjofWinkelmannQC
Copy link

@FrithjofWinkelmannQC FrithjofWinkelmannQC commented Jul 17, 2025

Motivation

Closes #87.

This PR introduces fsspec to support different storage backends for IO methods. For now, we assume all relevant access parameters are configured via environment variables.

Changes

  • Added fsspec to dependencies and modified IO calls in collection.py to use that instead of the native python library

assert isinstance(fs, fsspec.AbstractFileSystem)
try:
fs.makedir(directory, create_parents=True)
except FileExistsError:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs.makedir has no exists_ok parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try fs.makedirs? It seems to provide that parameter.

Copy link

codecov bot commented Jul 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (fdd8977) to head (60b6b6a).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #99   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           41        41           
  Lines         2401      2410    +9     
=========================================
+ Hits          2401      2410    +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@delsner delsner changed the title Fsspec for storage backends feat: Implement support for multiple storage backends Jul 17, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jul 17, 2025
Copy link
Member

@delsner delsner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress, thanks!

Comment on lines 682 to 684
fs, _ = fsspec.url_to_fs(
directory if isinstance(directory, str) else str(directory)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fs, _ = fsspec.url_to_fs(
directory if isinstance(directory, str) else str(directory)
)
fs, _ = fsspec.url_to_fs(str(directory))

Isn't this the same?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching

assert isinstance(fs, fsspec.AbstractFileSystem)
try:
fs.makedir(directory, create_parents=True)
except FileExistsError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try fs.makedirs? It seems to provide that parameter.

except FileExistsError:
pass

with fs.open(f"{directory}{fs.sep}schema.json", "w") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to ensure that directory doesn't already end with a separator here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with directory = directory.rstrip(fs.sep)

Comment on lines 870 to 872
if isinstance(directory, Path):
directory = str(directory)
fs, _ = fsspec.url_to_fs(directory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(directory, Path):
directory = str(directory)
fs, _ = fsspec.url_to_fs(directory)
fs, _ = fsspec.url_to_fs(str(directory))

# First, we check whether the path provides the serialization of the collection.
# If it does, we check whether it matches this collection. If it does, we assume
# that the data adheres to the collection and we do not need to run validation.
if (json_serialization := directory / "schema.json").exists():
metadata = json_serialization.read_text()
if fs.exists(json_serialization := f"{directory}{fs.sep}schema.json"):
Copy link
Member

@delsner delsner Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.a. I think we need to make sure that directory doesn't already have a trailing path separator.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

pyproject.toml Outdated
Comment on lines 86 to 91
module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"]

[[tool.mypy.overrides]]
module = ["fsspec.*"]
follow_untyped_imports = true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"]
[[tool.mypy.overrides]]
module = ["fsspec.*"]
follow_untyped_imports = true
module = ["fsspec.*", "pyarrow.*", "pytest_benchmark.*", "sklearn.*"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add fsspec in pyproject.toml as a dependency.

os: [ubuntu-latest, windows-latest]
os: [ubuntu-latest]
environment: [py310, py311, py312, py313]
services:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Services are only supported on linux OS in Github CI. I didnt find a way to dry, so I just created two jobs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: did you try using docker compose up minio -d --wait instead of using GitHub actions services? This might also work on Windows.


def check_platform(path_fixture: str) -> None:
if platform.system() == "Windows" and path_fixture == "s3_path":
pytest.skip("Skipping because Minio is not set up in Windows CI")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just skip s3 tests if no s3 envvars are set to be able to run tests locally without starting minio.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense, will adjust :)

@FrithjofWinkelmannQC FrithjofWinkelmannQC marked this pull request as ready for review July 18, 2025 16:04
Copy link
Member

@delsner delsner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few smaller comments.

@@ -32,14 +32,55 @@ jobs:
- name: pre-commit
run: pixi run pre-commit-run --color=always --show-diff-on-failure

unit-tests:
unit-tests-linux:
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }}
name: Unit Tests (Linux) - ${{ matrix.environment }}

os: [ubuntu-latest, windows-latest]
os: [ubuntu-latest]
environment: [py310, py311, py312, py313]
services:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: did you try using docker compose up minio -d --wait instead of using GitHub actions services? This might also work on Windows.

@@ -14,13 +14,47 @@ permissions:

jobs:
unit-tests:
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }})
name: Unit Tests (Linuxs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: Unit Tests (Linuxs)
name: Unit Tests (Linux)

@@ -32,14 +32,55 @@ jobs:
- name: pre-commit
run: pixi run pre-commit-run --color=always --show-diff-on-failure

unit-tests:
unit-tests-linux:
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }}
timeout-minutes: 30
runs-on: ${{ matrix.os }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the matrix here at all?

def s3_path() -> str:
"""Fixture to provide a mock S3 path for testing."""
s3 = boto3.resource("s3")
bucket_name = "".join(random.choice(string.ascii_lowercase) for _ in range(10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just use a uuid here, wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please also test that everything works if a directory path with trailing / is passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does polars support anything else than s3:// paths beyond regular file paths? Should we test for those as well (e.g., file:// or hf://)?

@AndreasAlbertQC
Copy link
Collaborator

Hey @FrithjofWinkelmannQC, I wanted to check in with you on how we should proceed with this PR. A little background info: @borchero and I have been working towards integrating dataframely with deltalake, which we think is a generally useful storage backend to use. deltalake stores tables as parquet files and offers us some nice advantages over storing raw parquet files that we want to take advantage of (also outside of dataframely). To this end, I am currently preparing #109 , which would factor the storage specifics out of Schema and Collection and make it easier to add new storage types.

Anyway, my point is: If we have deltalake support, s3 support is included for free. Instead of .write_parquet, .read_parquet, it would then be .write_delta, .read_delta (see e.g. polars docs). For most use cases, this should be a transparent change to the parquet versions. What do you think? Happy to also chat about it offlien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend write_parquet and read/scan_parquet for Collection and Schema beyond filesystem locations
3 participants