-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gzip support to fwrite #3278
Add gzip support to fwrite #3278
Conversation
Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in a gzipped csv file. zlib is thread-safe and the gzip compression use the fwrite threads. Option compress="gzip" is added to fwrite et is automatically set when file ends with ".gz"
Codecov Report
@@ Coverage Diff @@
## master #3278 +/- ##
==========================================
- Coverage 94.81% 94.68% -0.14%
==========================================
Files 65 65
Lines 12094 12125 +31
==========================================
+ Hits 11467 11480 +13
- Misses 627 645 +18
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #3278 +/- ##
==========================================
- Coverage 94.81% 94.72% -0.09%
==========================================
Files 65 65
Lines 12094 12125 +31
==========================================
+ Hits 11467 11486 +19
- Misses 627 639 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice contribution. Reviewed mostly R code and UI. Also would be better to avoid formatting code like removing empty newlines.
R/fwrite.R
Outdated
isLOGICAL(col.names), isLOGICAL(append), isLOGICAL(row.names), | ||
isLOGICAL(verbose), isLOGICAL(showProgress), isLOGICAL(logical01), | ||
length(na) == 1L, #1725, handles NULL or character(0) input | ||
is.character(file) && length(file)==1L && !is.na(file), | ||
length(buffMB)==1L && !is.na(buffMB) && 1L<=buffMB && buffMB<=1024, | ||
length(nThread)==1L && !is.na(nThread) && nThread>=1L | ||
) | ||
|
||
is_gzip <- compress == "gzip" || grepl("\\.gz$", file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to recognise filename but we could warn if user explicitly set compress to none and uses gz filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit 75af89e, I propose this approch :
In fwrite, compress has now 3 options :
- default : gzip if file ends with .gz, else csv
- none : force csv
- gzip : force gzip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better "auto" instead of "default".
Anyway new argument is not required as we might do:
is_gzip <- compress == "gzip" || (missing(compress) && grepl("\\.gz$", file))
In fwrite, compress has now 3 options : * default : gzip if file ends with .gz, else csv * none : force csv * gzip : force gzip
if (.Platform$OS.type=="unix") { | ||
f <- tempfile() | ||
fwrite(data.table(a=c(1:3), b=c(1:3)), file=f, compress="gzip") | ||
test(1658.38, system(paste("zcat", f), intern=T), output='[1] "a,b" "1,1" "2,2" "3,3"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is manual check that will make those line (and next test too) fail, related to T
instead of TRUE
, see
Line 61 in 7d2acb5
# No T or F symbols in tests.Rraw. 24 valid F (quoted, column name or in data) and 1 valid T at the time of writing |
Thanks for reviews. I will close this PR for now to improve this enhancement. I saw 2 problems :
|
There is a program called |
This is a first attempt to implement gzipped csv output in fwrite (issue #2016).
It uses zlib library and replaces open/write/close used for csv file by gzopen/gzwrite/gzclose for gzipped csv. A second buffer, which size is around 10% bigger than main buffer, is allocated for each thread. zlib is thread-safe and the gzip compression use all the available threads.
Option
compress="gzip"
is added to fwrite and is automatically set when file ends with.gz
. Default iscompress = "none"
.On my system (Debian Linux), r-base-dev install zlib1g-dev, which is needed to compile. I don't need to add a -lz for gcc to compile but that may be not true for others systems and on Windows or Mac. I guess that file
cc.R
must be modified to indicate the zlib dependence but it's too hard for me.I've added 2 tests but it seems to be difficult to test a binary output like a gzipped csv. Test uses command
zcat
and runs only on unix platforms.Please feel free to test.