-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Implement support for multiple storage backends #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dataframely/collection.py
Outdated
assert isinstance(fs, fsspec.AbstractFileSystem) | ||
try: | ||
fs.makedir(directory, create_parents=True) | ||
except FileExistsError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs.makedir
has no exists_ok
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try fs.makedirs
? It seems to provide that parameter.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 41 41
Lines 2401 2410 +9
=========================================
+ Hits 2401 2410 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress, thanks!
dataframely/collection.py
Outdated
fs, _ = fsspec.url_to_fs( | ||
directory if isinstance(directory, str) else str(directory) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs, _ = fsspec.url_to_fs( | |
directory if isinstance(directory, str) else str(directory) | |
) | |
fs, _ = fsspec.url_to_fs(str(directory)) |
Isn't this the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching
dataframely/collection.py
Outdated
assert isinstance(fs, fsspec.AbstractFileSystem) | ||
try: | ||
fs.makedir(directory, create_parents=True) | ||
except FileExistsError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try fs.makedirs
? It seems to provide that parameter.
except FileExistsError: | ||
pass | ||
|
||
with fs.open(f"{directory}{fs.sep}schema.json", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to ensure that directory
doesn't already end with a separator here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with directory = directory.rstrip(fs.sep)
dataframely/collection.py
Outdated
if isinstance(directory, Path): | ||
directory = str(directory) | ||
fs, _ = fsspec.url_to_fs(directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(directory, Path): | |
directory = str(directory) | |
fs, _ = fsspec.url_to_fs(directory) | |
fs, _ = fsspec.url_to_fs(str(directory)) |
# First, we check whether the path provides the serialization of the collection. | ||
# If it does, we check whether it matches this collection. If it does, we assume | ||
# that the data adheres to the collection and we do not need to run validation. | ||
if (json_serialization := directory / "schema.json").exists(): | ||
metadata = json_serialization.read_text() | ||
if fs.exists(json_serialization := f"{directory}{fs.sep}schema.json"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.a. I think we need to make sure that directory
doesn't already have a trailing path separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
pyproject.toml
Outdated
module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | ||
|
||
[[tool.mypy.overrides]] | ||
module = ["fsspec.*"] | ||
follow_untyped_imports = true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module = ["pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | |
[[tool.mypy.overrides]] | |
module = ["fsspec.*"] | |
follow_untyped_imports = true | |
module = ["fsspec.*", "pyarrow.*", "pytest_benchmark.*", "sklearn.*"] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to add fsspec
in pyproject.toml
as a dependency.
os: [ubuntu-latest, windows-latest] | ||
os: [ubuntu-latest] | ||
environment: [py310, py311, py312, py313] | ||
services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Services are only supported on linux OS in Github CI. I didnt find a way to dry, so I just created two jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: did you try using docker compose up minio -d --wait
instead of using GitHub actions services
? This might also work on Windows.
|
||
def check_platform(path_fixture: str) -> None: | ||
if platform.system() == "Windows" and path_fixture == "s3_path": | ||
pytest.skip("Skipping because Minio is not set up in Windows CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just skip s3 tests if no s3 envvars are set to be able to run tests locally without starting minio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes more sense, will adjust :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few smaller comments.
@@ -32,14 +32,55 @@ jobs: | |||
- name: pre-commit | |||
run: pixi run pre-commit-run --color=always --show-diff-on-failure | |||
|
|||
unit-tests: | |||
unit-tests-linux: | |||
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} | |
name: Unit Tests (Linux) - ${{ matrix.environment }} |
os: [ubuntu-latest, windows-latest] | ||
os: [ubuntu-latest] | ||
environment: [py310, py311, py312, py313] | ||
services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: did you try using docker compose up minio -d --wait
instead of using GitHub actions services
? This might also work on Windows.
@@ -14,13 +14,47 @@ permissions: | |||
|
|||
jobs: | |||
unit-tests: | |||
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) | |||
name: Unit Tests (Linuxs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: Unit Tests (Linuxs) | |
name: Unit Tests (Linux) |
@@ -32,14 +32,55 @@ jobs: | |||
- name: pre-commit | |||
run: pixi run pre-commit-run --color=always --show-diff-on-failure | |||
|
|||
unit-tests: | |||
unit-tests-linux: | |||
name: Unit Tests (${{ matrix.os == 'ubuntu-latest' && 'Linux' || 'Windows' }}) - ${{ matrix.environment }} | |||
timeout-minutes: 30 | |||
runs-on: ${{ matrix.os }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the matrix here at all?
def s3_path() -> str: | ||
"""Fixture to provide a mock S3 path for testing.""" | ||
s3 = boto3.resource("s3") | ||
bucket_name = "".join(random.choice(string.ascii_lowercase) for _ in range(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just use a uuid
here, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please also test that everything works if a directory path with trailing /
is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does polars support anything else than s3://
paths beyond regular file paths? Should we test for those as well (e.g., file://
or hf://
)?
Hey @FrithjofWinkelmannQC, I wanted to check in with you on how we should proceed with this PR. A little background info: @borchero and I have been working towards integrating dataframely with Anyway, my point is: If we have |
Motivation
Closes #87.
This PR introduces
fsspec
to support different storage backends for IO methods. For now, we assume all relevant access parameters are configured via environment variables.Changes
fsspec
to dependencies and modified IO calls incollection.py
to use that instead of the native python library