[FEAT] Overwrite mode for write parquet/csv #3108

colin-ho · 2024-10-23T19:06:18Z

Addresses: #3112 and #1768

Implements overwrite mode for write_parquet and write_csv.

Upon finishing the write, we are left with a manifest of written file paths. We can use this to perform a delete all files not in manifest, by:

Do an ls to figure out all the current files in the root dir.
Use daft's built in is_in expression to get the file paths to delete.
Delete them.

Notes:

Relies on fsspec for ls and rm functionalities. This is favored over pyarrow filesystem because rm is a bulk delete method, aka we can do the delete in a single API call. Pyarrow filesystem does not have bulk deletes.

codspeed-hq · 2024-10-23T19:17:10Z

CodSpeed Performance Report

Merging #3108 will not alter performance

_{Comparing colin/overwrite-writes (5fdc9d9) with main (5b450fb)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-10-23T21:18:35Z

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (c69ee3f) to head (5fdc9d9).
Report is 41 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/dataframe/dataframe.py	77.77%	2 Missing ⚠️
daft/filesystem.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3108      +/-   ##
==========================================
- Coverage   78.65%   78.63%   -0.02%     
==========================================
  Files         618      621       +3     
  Lines       73192    74150     +958     
==========================================
+ Hits        57568    58310     +742     
- Misses      15624    15840     +216

Files with missing lines	Coverage Δ
daft/filesystem.py	`70.62% <93.33%> (+1.90%)`	⬆️
daft/dataframe/dataframe.py	`86.48% <77.77%> (-0.09%)`	⬇️

... and 34 files with indirect coverage changes

jaychia · 2024-10-25T00:44:49Z

daft/dataframe/dataframe.py

@@ -513,6 +514,7 @@ def write_parquet(
        self,
        root_dir: Union[str, pathlib.Path],
        compression: str = "snappy",
+        write_mode: str = "append",


Can we use Union[Literal["append"], Literal["overwrite"]]?

jaychia · 2024-10-25T00:47:09Z

daft/filesystem.py

+    else:
+        raise NotImplementedError(f"Cannot infer Fsspec filesystem for protocol {protocol}: please file an issue!")
+
+    return fs


I would avoid fsspec if possible here so we can avoid taking a dependency on it.

I'm also not sure if the bulk delete API has any performance benefit over a serial delete call... I guess for the services (e.g. S3) we can parallelize DELETE requests over the wire.

Ideally we can support DELETE on our own IO clients, but in the absence of that shall we use PyArrow instead and just naively delete one-by-one?

Yeah we can use PyArrow, should be easier to implement anyways since we already have the machinery to infer a pyarrow filesystem. I can also run some tests to see if paralellizing the deletes make sense.

samster25 · 2024-10-31T19:49:33Z

daft/filesystem.py

@@ -353,3 +358,22 @@ def join_path(fs: pafs.FileSystem, base_path: str, *sub_paths: str) -> str:
        return os.path.join(base_path, *sub_paths)
    else:
        return f"{base_path.rstrip('/')}/{'/'.join(sub_paths)}"
+
+
+def overwrite_files(


It doesn't look like we're supporting partition overwrite mode here. Where we only delete files if we wrote a new file into that partition.

Could we leverage what @desmondcheongzx is working on for hive style reads to be able to do this?

Probably, but we could also just match on the directory paths and delete accordingly. Will make a separate PR for this.

init

5cb5568

github-actions bot added the enhancement New feature or request label Oct 23, 2024

Colin Ho and others added 6 commits October 23, 2024 12:46

test

95b6319

test

3858353

List

ee8b5f9

lint

be22ce3

fix partition col args

7a5bc2c

swordfish skip

9252613

Colin Ho and others added 5 commits October 23, 2024 14:42

rename bucket

b3fa1f8

comment about performance

52d8d36

use fsspec

1ba066d

clean up kwargs

3668f2e

cleanup test

c92b6c9

colin-ho requested review from jaychia and samster25 October 24, 2024 19:58

jaychia reviewed Oct 25, 2024

View reviewed changes

Colin Ho added 4 commits October 24, 2024 19:55

use pyarrow

f5f44bf

fix test

cffe79a

reduce test size

3485573

tests too slow

5fdc9d9

samster25 approved these changes Oct 31, 2024

View reviewed changes

colin-ho merged commit 0d669ca into main Nov 6, 2024
42 checks passed

colin-ho deleted the colin/overwrite-writes branch November 6, 2024 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Overwrite mode for write parquet/csv #3108

[FEAT] Overwrite mode for write parquet/csv #3108

colin-ho commented Oct 23, 2024 •

edited

Loading

codspeed-hq bot commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading

jaychia Oct 25, 2024

jaychia Oct 25, 2024

colin-ho Oct 25, 2024

samster25 Oct 31, 2024

colin-ho Nov 6, 2024

[FEAT] Overwrite mode for write parquet/csv #3108

[FEAT] Overwrite mode for write parquet/csv #3108

Conversation

colin-ho commented Oct 23, 2024 • edited Loading

codspeed-hq bot commented Oct 23, 2024 • edited Loading

CodSpeed Performance Report

Merging #3108 will not alter performance

Summary

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

jaychia Oct 25, 2024

Choose a reason for hiding this comment

jaychia Oct 25, 2024

Choose a reason for hiding this comment

colin-ho Oct 25, 2024

Choose a reason for hiding this comment

samster25 Oct 31, 2024

Choose a reason for hiding this comment

colin-ho Nov 6, 2024

Choose a reason for hiding this comment

colin-ho commented Oct 23, 2024 •

edited

Loading

codspeed-hq bot commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading