-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise dcast.data.table to reduce peak memory usage (currently: reaches 4x table size) #1069
Comments
I've added this to the list, but note that system.time(ans <- melt(dt, id=1:3, measure=list(4:11, 12:19), value=c("A_dd", "B_dd")))
# user system elapsed
# 0.035 0.012 0.047
head(ans)
# IDCol1 IDCol2 IDCol3 variable A_dd B_dd
# 1: Indiana New Jersey 1 1 5.06 21.02
# 2: Maine Tennessee 2 1 21.17 22.40
# 3: New Hampshire Louisiana 3 1 22.14 13.83
# 4: Virginia Hawaii 4 1 3.66 14.95
# 5: Hawaii Montana 5 1 2.93 8.00
# 6: Vermont Hawaii 6 1 18.77 13.70 You can order it using setorder(ans, IDCol1, IDCol2, IDCol3, variable) PS: I really appreciate the time and effort you take to file a report. |
You're most welcome, Arun! And I am very excited that you've taken this up so quickly! From my little experiment above, it appears that Among the many discussions about R on SO (not just for Unfortunately, there is a physical limit to how much RAM a machine can handle. My local machine's motherboard maxes out at 16GB while the office's shared workstation has 96GB RAM (not sure what it's maximum capacity is though).
|
On memory, I think we've made attempts in numerous cases to optimise for both. I've detailed some of what the syntax offers in this SO post. That is also the reason we implemented |
By golly, Arun, I've finally gotten the new 'multi-column' Back to memory-efficiency issue for 'melt-process-cast': you're right, it's certainly more memory-efficient to Excited, I proceeded to use the feature to handle when the measure columns have 2 levels of encoding but it didn't work (see error message below). Is this possible in the new feature and if yes, how can I code this? In my dataset, I would need to tackle this too, so hope you can advise. New code with 2 levels of encoding in column-names:
Error Message:
After 2 weeks of being stuck on this data manipulation issue, there's some light... thank you so much! |
You're melting into 4 different columns, but providing only two |
Notes: With the recent update to # Using data.table to re-cast L_dt to W_dt
W_dt1 <- dcast.data.table( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable,
value.var = "value" )
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 399610 21.4 741108 39.6 411115 22.0
# Vcells 60667030 462.9 108295982 826.3 102697779 783.6 and with # Using Reshape2 to re-cast L_dt to W_dt
W_dt2 <- dcast( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable,
value.var = "value" )
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 402958 21.6 3170326 169.4 4903021 261.9
# Vcells 64797629 494.4 160766414 1226.6 160267965 1222.8 But much better improvements are possible by not have to use |
Arun, I have implemented my data transformation using the multi-col |
I don't fully understand why you'd to melt-cast twice. If you could show an example, that'd help (identify if things could be improved). I'm keeping the FR open as I see further improvements possible for |
Glad to help and I'm always open to possible ways to improve memory efficiency and my procedure. I needed to melt-cast twice in order to handle wide tables with columns containing 2-levels of encoding in the column names. Below is a minimal example:
In case you're wondering if such 2-level encoding data exists, let me give you a theoretical but possible example (1st column shown only): In the above, GDP and GNP are the 'variables' and if I want the Years and H1/H2 to be separate measures, I would need to Data comes in various shapes and sizes, and how to transform it depends on the analyst's requirements. IMHO, I believe your current 'multi-column' melt in v1.9.5 is flexible enough to handle different situations while concurrently not being overly-complicated in terms of syntax. This is why I suggested previously that no further enhancements in this regard is necessary (of course, if there is a way I can just melt once for my example, please let me know). |
Thanks, I see now. It requires
Yes, I agree. Will close this FR after figuring out if it is possible to get |
Will take care not to use |
Hello guys. I've had the same problem two months ago and this was my solution. Instead of trying to convert the whole datatable at once I broke the operation in a loop, reading a few lines at a time, and saving the result on a file.
|
On current master:
i.e. similar to Arun's post on 2015-03-16 |
I am running into this issue multiple times with R session terminating with memory exhaustion. I have started using this function as a replacement (trade-off with speed).
Hoping to see a fast solution. |
I am using
data.table
'smelt
anddcast
function to melt a wide table, do some transformation and cast it back to a tidy format.The Issue
I tested with a small file of ~1GB and was very surprised when R ran out of memory on my i7, 16GB machine. I investigated further and realise that
dcast.data.table
hit a peak memory of ~4x the long-format file's size.Running my sample code below, the memory requirements for the process are as follow:
I added
reshape2
'sdcast
for comparison:My Request
Can memory usage for
dcast.data.table
be optimised further?dcast.data.table
is certainly faster than its counterpart but it appears there is some memory overhead. The 1GB file I tested originally is only ~30% of the full dataset, so adcast.data.table
that is more frugal is very much appreciated =)Sample Data and Code
The text was updated successfully, but these errors were encountered: