-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread for directories #2582
Comments
My thinking was that whenever However I don't think that automatically |
good feedback. actually it's a good point, as some users may want to use
the forthcoming `cbindlist` as well, or to `Reduce(merge)` the items.
|
One use-case where I've found R wanting is when the directory contains a very large number of small files (i.e. 100,000 to 1,000,000 files of 1-10 kB). In such cases, |
Interesting use case, did you try to investigate where the bottleneck is? |
It looks something like this which is not as bad as I remember (Matt, can you stop improving the package? It's ruining my anecdotes.) [System.IO.Directory]::GetFiles("address", "*.*").Count
463716
system.time(list.files(path = "address",
pattern = "\\.csv$",
full.names = TRUE))
# user system elapsed
# 7.66 1.32 8.99
Files <- list.files(path = "address",
pattern = "\\.csv$",
full.names = TRUE)
system.time(lapply(Files[1:100], fread, fill = TRUE))
# user system elapsed
# 2.38 0.08 2.59
system.time(lapply(Files[1:1e2], fread, sep = ",", colClasses = "character", fill = TRUE))
# user system elapsed
# 1.58 0.10 1.67
system.time(lapply(Files[1:1e4], fread, sep = ",", colClasses = "character", fill = TRUE))
# user system elapsed
# 6.97 5.00 22.67 |
@st-pasha I assume it's in also, relating to your first comment, returning the source attributes in the names would highlight the utility of #1948 as well for manipulating these objects in post. |
Another use case I'm running into (.. not sure how different it is from the preceding): I wrote a helper function to read a csv inside tar.gz like
but now I have tar.gz containing multiple csvs (that should have identical column names and classes), and it seems I'll need to go another way (I guess: run the 7z call then lapply fread on the files it drops > confirm columns match > rbindlist). |
Just to have some thoughts written down: There's a pretty simple version of this where we just wrap there's also a substantially more involved version where directory-level Definitely the first version should use the simple approach, but it almost surely won't be faster (for many use cases) than using the terminal to |
IMO it is not good if |
A simple |
Idle musing -- if |
Unless fill=T expected |
Some file I/O APIs I've worked with have a simple idiom for reading full directories:
would be read as, e.g. in
spark
,A basic idiom has developed for
fread
to do this by adding bells and whistles to the following:It would be simple enough to wedge directory reading into the
fread
API by changing:to (pseudocode around the
match.call()
part)However, it might be nice to build in some flexibility to this, e.g. allowing the
list.files
to optionally be recursive, implementing some API for automatic source naming (if there are subdirectories, and the names of the subdirectories contain information, the manual version of this allows a bit more flexibility), specifyingidcol
orfill
, etc.So there's two questions here
fread
? Just add...
and post-process if theis.dir
branch is reached? Separate function call altogether?The text was updated successfully, but these errors were encountered: