Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gzip support to fwrite #3278

Closed

Conversation

philippechataignon
Copy link
Contributor

@philippechataignon philippechataignon commented Jan 12, 2019

This is a first attempt to implement gzipped csv output in fwrite (issue #2016).

It uses zlib library and replaces open/write/close used for csv file by gzopen/gzwrite/gzclose for gzipped csv. A second buffer, which size is around 10% bigger than main buffer, is allocated for each thread. zlib is thread-safe and the gzip compression use all the available threads.

Option compress="gzip" is added to fwrite and is automatically set when file ends with .gz. Default is compress = "none".

On my system (Debian Linux), r-base-dev install zlib1g-dev, which is needed to compile. I don't need to add a -lz for gcc to compile but that may be not true for others systems and on Windows or Mac. I guess that file cc.R must be modified to indicate the zlib dependence but it's too hard for me.

I've added 2 tests but it seems to be difficult to test a binary output like a gzipped csv. Test uses command zcat and runs only on unix platforms.

Please feel free to test.

Philippe Chataignon added 3 commits January 12, 2019 21:56
Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in
a gzipped csv file.

zlib is thread-safe and the gzip compression use the fwrite threads.

Option compress="gzip" is added to fwrite et is automatically set when file ends
with ".gz"
@codecov
Copy link

codecov bot commented Jan 12, 2019

Codecov Report

Merging #3278 into master will decrease coverage by 0.13%.
The diff coverage is 47.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.68%   -0.14%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11480      +13     
- Misses        627      645      +18
Impacted Files Coverage Δ
R/fwrite.R 96.55% <100%> (+0.18%) ⬆️
src/fwriteR.c 94.95% <100%> (+0.04%) ⬆️
src/fwrite.c 87.78% <41.46%> (-3.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...041bb4a. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 12, 2019

Codecov Report

Merging #3278 into master will decrease coverage by 0.08%.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.72%   -0.09%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11486      +19     
- Misses        627      639      +12
Impacted Files Coverage Δ
R/fwrite.R 96.55% <100%> (+0.18%) ⬆️
src/fwriteR.c 94.95% <100%> (+0.04%) ⬆️
src/fwrite.c 89.02% <56.09%> (-1.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...67a9c36. Read the comment docs.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice contribution. Reviewed mostly R code and UI. Also would be better to avoid formatting code like removing empty newlines.

R/fwrite.R Outdated
isLOGICAL(col.names), isLOGICAL(append), isLOGICAL(row.names),
isLOGICAL(verbose), isLOGICAL(showProgress), isLOGICAL(logical01),
length(na) == 1L, #1725, handles NULL or character(0) input
is.character(file) && length(file)==1L && !is.na(file),
length(buffMB)==1L && !is.na(buffMB) && 1L<=buffMB && buffMB<=1024,
length(nThread)==1L && !is.na(nThread) && nThread>=1L
)

is_gzip <- compress == "gzip" || grepl("\\.gz$", file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to recognise filename but we could warn if user explicitly set compress to none and uses gz filename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit 75af89e, I propose this approch :

In fwrite, compress has now 3 options :

  • default : gzip if file ends with .gz, else csv
  • none : force csv
  • gzip : force gzip

Copy link
Member

@jangorecki jangorecki Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better "auto" instead of "default".
Anyway new argument is not required as we might do:

is_gzip <- compress == "gzip" || (missing(compress) && grepl("\\.gz$", file))

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
inst/tests/tests.Rraw Outdated Show resolved Hide resolved
man/fwrite.Rd Outdated Show resolved Hide resolved
Philippe Chataignon added 4 commits January 13, 2019 10:19
if (.Platform$OS.type=="unix") {
f <- tempfile()
fwrite(data.table(a=c(1:3), b=c(1:3)), file=f, compress="gzip")
test(1658.38, system(paste("zcat", f), intern=T), output='[1] "a,b" "1,1" "2,2" "3,3"')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is manual check that will make those line (and next test too) fail, related to T instead of TRUE, see

# No T or F symbols in tests.Rraw. 24 valid F (quoted, column name or in data) and 1 valid T at the time of writing

@philippechataignon
Copy link
Contributor Author

philippechataignon commented Jan 14, 2019

Thanks for reviews. I will close this PR for now to improve this enhancement. I saw 2 problems :

  • actually append is not implemented for gzip : I don't know if it has a sense to append a gzip.
  • the approach with gzwrite create a bootleneck : only one thread compress. Gzip in thread would be a better solution but it requires to modify the actual buffer gestion.

@st-pasha
Copy link
Contributor

There is a program called pigz which implements parallel writing of gzip archives. It works on top of zlib: each thread creates compresses its own block of data, and then sends ZLIB_SYNC_FLUSH command, which allows compressed blocks be concatenated. Same strategy can be used here.

@philippechataignon philippechataignon mentioned this pull request Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants