Add gzip support to fwrite #3278

philippechataignon · 2019-01-12T22:27:06Z

This is a first attempt to implement gzipped csv output in fwrite (issue #2016).

It uses zlib library and replaces open/write/close used for csv file by gzopen/gzwrite/gzclose for gzipped csv. A second buffer, which size is around 10% bigger than main buffer, is allocated for each thread. zlib is thread-safe and the gzip compression use all the available threads.

Option compress="gzip" is added to fwrite and is automatically set when file ends with .gz. Default is compress = "none".

On my system (Debian Linux), r-base-dev install zlib1g-dev, which is needed to compile. I don't need to add a -lz for gcc to compile but that may be not true for others systems and on Windows or Mac. I guess that file cc.R must be modified to indicate the zlib dependence but it's too hard for me.

I've added 2 tests but it seems to be difficult to test a binary output like a gzipped csv. Test uses command zcat and runs only on unix platforms.

Please feel free to test.

Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in a gzipped csv file. zlib is thread-safe and the gzip compression use the fwrite threads. Option compress="gzip" is added to fwrite et is automatically set when file ends with ".gz"

codecov · 2019-01-12T22:38:53Z

Codecov Report

Merging #3278 into master will decrease coverage by 0.13%.
The diff coverage is 47.82%.

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.68%   -0.14%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11480      +13     
- Misses        627      645      +18

Impacted Files	Coverage Δ
R/fwrite.R	`96.55% <100%> (+0.18%)`	⬆️
src/fwriteR.c	`94.95% <100%> (+0.04%)`	⬆️
src/fwrite.c	`87.78% <41.46%> (-3.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...041bb4a. Read the comment docs.

codecov · 2019-01-12T22:38:54Z

Codecov Report

Merging #3278 into master will decrease coverage by 0.08%.
The diff coverage is 60.86%.

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.72%   -0.09%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11486      +19     
- Misses        627      639      +12

Impacted Files	Coverage Δ
R/fwrite.R	`96.55% <100%> (+0.18%)`	⬆️
src/fwriteR.c	`94.95% <100%> (+0.04%)`	⬆️
src/fwrite.c	`89.02% <56.09%> (-1.99%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...67a9c36. Read the comment docs.

jangorecki

Very nice contribution. Reviewed mostly R code and UI. Also would be better to avoid formatting code like removing empty newlines.

jangorecki · 2019-01-13T02:33:30Z

R/fwrite.R

    isLOGICAL(col.names), isLOGICAL(append), isLOGICAL(row.names),
    isLOGICAL(verbose), isLOGICAL(showProgress), isLOGICAL(logical01),
    length(na) == 1L, #1725, handles NULL or character(0) input
    is.character(file) && length(file)==1L && !is.na(file),
    length(buffMB)==1L && !is.na(buffMB) && 1L<=buffMB && buffMB<=1024,
    length(nThread)==1L && !is.na(nThread) && nThread>=1L
    )
+
+  is_gzip <- compress == "gzip" || grepl("\\.gz$", file)


Nice to recognise filename but we could warn if user explicitly set compress to none and uses gz filename

In commit 75af89e, I propose this approch :

In fwrite, compress has now 3 options :

default : gzip if file ends with .gz, else csv

none : force csv

gzip : force gzip

Might be better "auto" instead of "default".
Anyway new argument is not required as we might do:

is_gzip <- compress == "gzip" || (missing(compress) && grepl("\\.gz$", file))

inst/tests/tests.Rraw

man/fwrite.Rd

In fwrite, compress has now 3 options : * default : gzip if file ends with .gz, else csv * none : force csv * gzip : force gzip

jangorecki · 2019-01-13T11:14:49Z

inst/tests/tests.Rraw

+if (.Platform$OS.type=="unix") {
+  f <- tempfile()
+  fwrite(data.table(a=c(1:3), b=c(1:3)), file=f, compress="gzip")
+  test(1658.38, system(paste("zcat", f), intern=T), output='[1] "a,b" "1,1" "2,2" "3,3"')


there is manual check that will make those line (and next test too) fail, related to T instead of TRUE, see

data.table/CRAN_Release.cmd

Line 61 in 7d2acb5

# No T or F symbols in tests.Rraw. 24 valid F (quoted, column name or in data) and 1 valid T at the time of writing

philippechataignon · 2019-01-14T08:08:17Z

Thanks for reviews. I will close this PR for now to improve this enhancement. I saw 2 problems :

actually append is not implemented for gzip : I don't know if it has a sense to append a gzip.
the approach with gzwrite create a bootleneck : only one thread compress. Gzip in thread would be a better solution but it requires to modify the actual buffer gestion.

st-pasha · 2019-01-14T09:01:45Z

There is a program called pigz which implements parallel writing of gzip archives. It works on top of zlib: each thread creates compresses its own block of data, and then sends ZLIB_SYNC_FLUSH command, which allows compressed blocks be concatenated. Same strategy can be used here.

Philippe Chataignon added 3 commits January 12, 2019 21:56

Add gzip support to fwrite

f1a9bc6

Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in a gzipped csv file. zlib is thread-safe and the gzip compression use the fwrite threads. Option compress="gzip" is added to fwrite et is automatically set when file ends with ".gz"

Add compress= option in fwrite documentation

6053b7c

Add tests for fwrite with compress="gzip" option

73947c6

philippechataignon added enhancement fwrite labels Jan 12, 2019

Rewrite test 1658.12

041bb4a

jangorecki reviewed Jan 13, 2019

View reviewed changes

Philippe Chataignon added 4 commits January 13, 2019 10:19

Add default option in compress

75af89e

In fwrite, compress has now 3 options : * default : gzip if file ends with .gz, else csv * none : force csv * gzip : force gzip

Adapt fwrite compress option documentation

b3c2ae9

Replace 'default' by 'auto' in fwrite compress option

3fb7ecf

Tests for gzip compression in fwrite and restore tests 1658.11,12

da79731

jangorecki reviewed Jan 13, 2019

View reviewed changes

\#endif was in wrong place

67a9c36

philippechataignon closed this Jan 14, 2019

philippechataignon mentioned this pull request Jan 16, 2019

Fwrite gzip #3288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gzip support to fwrite #3278

Add gzip support to fwrite #3278

philippechataignon commented Jan 12, 2019 •

edited

Loading

codecov bot commented Jan 12, 2019

codecov bot commented Jan 12, 2019 •

edited

Loading

jangorecki left a comment

jangorecki Jan 13, 2019

philippechataignon Jan 13, 2019

jangorecki Jan 13, 2019 •

edited

Loading

jangorecki Jan 13, 2019

philippechataignon commented Jan 14, 2019 •

edited

Loading

st-pasha commented Jan 14, 2019

Add gzip support to fwrite #3278

Add gzip support to fwrite #3278

Conversation

philippechataignon commented Jan 12, 2019 • edited Loading

codecov bot commented Jan 12, 2019

Codecov Report

codecov bot commented Jan 12, 2019 • edited Loading

Codecov Report

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki Jan 13, 2019

Choose a reason for hiding this comment

philippechataignon Jan 13, 2019

Choose a reason for hiding this comment

jangorecki Jan 13, 2019 • edited Loading

Choose a reason for hiding this comment

jangorecki Jan 13, 2019

Choose a reason for hiding this comment

philippechataignon commented Jan 14, 2019 • edited Loading

st-pasha commented Jan 14, 2019

philippechataignon commented Jan 12, 2019 •

edited

Loading

codecov bot commented Jan 12, 2019 •

edited

Loading

jangorecki Jan 13, 2019 •

edited

Loading

philippechataignon commented Jan 14, 2019 •

edited

Loading