Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 1 column to data table nearly doubles peak memory used #1062

Closed
NoviceProg opened this issue Mar 4, 2015 · 8 comments
Closed

Adding 1 column to data table nearly doubles peak memory used #1062

NoviceProg opened this issue Mar 4, 2015 · 8 comments
Labels

Comments

@NoviceProg
Copy link

I have a wide CSV with thousands of columns that I fread into R to do further transformation. When I added an empty column using := filled with NAs, R crashed with an "out-of-memory" error. I was initially perplexed as I had 16GB RAM and the data.table before adding the new column was only ~9GB in memory.

To investigate further, I created a mock 2500 columns by 200000 row table. I noticed peak memory nearly doubles from 3.8GB to 7.6GB when a new column is added. Only upon running gc() did memory return to 3.8GB. Both ver 1.9.4 and the latest 1.9.5 exhibit this issue (please refer to the 2 printscreens attached).

I raised the question in Stack Overflow (URL below) as I believe this should not be happening. After some discussion, Arun encouraged me to file this report.

My system: Intel i7-4700 with 4-core/8-thread; 16GB DDR3-12800 RAM; Windows 8.1 64-bit; 500GB 7200rpm HDD; 64-bit R; Data Table ver 1.9.4 and 1.9.5

The code I used to create the mock wide dataset is available at my SO question (nothing special, not repeated here for brevity).

Data package has been an invaluable tool in my R project and I would like to thank Matthew, Arun and other contributors for this remarkable package! I hope this issue report can be my little contribution.

https://stackoverflow.com/questions/28347305/r-why-adding-1-column-to-data-table-nearly-doubles-peak-memory-used

150304 - prntscrn for ver 1 9 5
150304 - prntscrn for ver 1 9 4

@NoviceProg
Copy link
Author

Hi Arun, can I check if you had a chance to look at this issue? While working on issue #1069, I ran the latest ver 1.9.5 in Linux Mint, but the issue persists. Thus, it is not a problem isolated to Windows only.

@arunsrinivasan
Copy link
Member

I know where the issue is, just not sure how to fix it yet :-(. It's in print.data.table():

if (.global$print != "" && address(x) == .global$print) {
        SYS <- sys.calls()
        if ((length(last(SYS)) >= 2L && typeof(last(SYS)[[2L]]) %chin% 
            c("list", "promise")) || (length(SYS) > 3L && SYS[[length(SYS) - 
            3L]][[1L]] == "knit_print.default")) {
            .global$print = ""
            return(invisible())
        }
    }

The line SYS <- sys.calls() seems to be the issue.

@NoviceProg
Copy link
Author

I see, appreciate your reply! Would you be able to tag it with a label to keep this issue in-view? Data.table's := update-by-reference concept is invaluable and I believe many users would be grateful like me for this to be resolved.

@NoviceProg
Copy link
Author

Arun, I just tested deleting a column using := NULL as well as replacing a column with a vector, V, of the same number of elements using := V, with=F. They too displayed the same phenomenon, i.e. peak memory doubled in both cases. Would this be due to the same suspected bug you mentioned? Should I file a new issue report?

@arunsrinivasan
Copy link
Member

with=FALSEreturns a copy, that's understandable. There's no need to file another issue. := issue will be gone when this issue is fixed. I can't guarantee when we'll get to it.

@NoviceProg
Copy link
Author

Understand. One can't rush inspiration. It's more important to have a good fix than to rush it.

@arunsrinivasan
Copy link
Member

@NoviceProg this seems to have been fixed in R v3.2, IIUC, with this item from NEWS:

Auto-printing no longer duplicates objects when printing is dispatched to a method.

Yay! Testing the data from your SO post has no rise in memory usage. Could you please test (and close if solved?). Thanks.

@NoviceProg
Copy link
Author

Sorry for the delay in replying, Arun. Was traveling a bit for work.

Yes, I confirm that, after upgrading to R v3.2, the issue is fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants