Memory mapping/nrows #3526

parayamelo · 2019-04-28T20:12:24Z

Hello,

I am using data.table v1.12.2, and trying to read a file with fread whose size is ~170gb. It is a Windows server machine. It says I have 45.3gb available out of 64gb memory. I get the error
"Opened 168.7GB file ok but could not memory map it. This is a 32bit process. Please upgrade to 64bit."

sessionInfo tells me I am running R v3.5.3 on a 64bit platform. I try to read fewer rows, i.e., option "nrows = 100000", but I always get the same error, i.e., it is mapping to entire file to memory. It is like it is not recognizing the "nrows" option. I also tried with less nthreads (it has maximum 8), but I get the same result. I also tried reading 1 column and fewer rows, but same problem.

Is there a way around this? Due to server policies, I can not install anything new, so my hands are tied to what it is already installed in the server.

Thank you.

mattdowle · 2019-04-30T20:55:10Z

This part of the error message is reliable: "This is a 32bit process."

sessionInfo tells me I am running R v3.5.3 on a 64bit platform.

Then, either i) wherever you are running sessionInfo() is not the same place you're seeing fread() produce "this is a 32bit process", or ii) you're interpretting the output of sessionInfo() incorrectly. Please post the output of sessionInfo() as the issue template asks please. We ask for that so we can help.

Here are two examples from sessionInfo() on Windows :

Platform: x86_64-w64-mingw32 (64-bit)
Platform: i386-w64-mingw32/i386 (32-bit)

It's the part in brackets at the end that's significant. You need to use the 64bit version of R on your server. Please confirm this solves it.

parayamelo · 2019-05-01T00:47:32Z

Hi Matt,

thank you for your message.

sessionInfo() indicates

Platform: x86_64-w64-mingw32 (64-bit).

mattdowle · 2019-05-01T00:57:04Z

To double check please post the output of .Machine.
Can you produce the fread message that states "This is a 32bit process" in the same R session that you can get sessionInfo() to report 64bit and .Machine$sizeof.pointer to return 8? I suspect you are running R manually to get sessionInfo() but the fread call is happening in another process: the two R processes are not being started in the same way and that's causing one to be 32bit and one to be 64bit. Could that possibly be the case?
Also, please run test.data.table() and paste the full output.
Please also post the full output of sessionInfo().

You could also try .dynLibs() to find the location of the datatable.dll being loaded into your 64bit Windows R session and make sure it is a 64bit DLL. Maybe there's an installation problem on your server.
Just for example to give you an idea (it will be different on Windows for you), here's some output for me below on Linux. You would have to research how on Windows to find out whether a .dll is 32bit or 64bit. But I didn't think it was possible to load a 32bit DLL into a 64bit process on Windows.

> .dynLibs()
                                                   Filename Dynamic.Lookup
1                /usr/lib/R/library/methods/libs/methods.so          FALSE
2                    /usr/lib/R/library/utils/libs/utils.so          FALSE
3                    /usr/lib/R/library/tools/libs/tools.so          FALSE
4            /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
5              /usr/lib/R/library/graphics/libs/graphics.so          FALSE
6                    /usr/lib/R/library/stats/libs/stats.so          FALSE
7 /home/mdowle/build/revdeplib/data.table/libs/datatable.so          FALSE
> system("file /home/mdowle/build/revdeplib/data.table/libs/datatable.so")
/home/mdowle/build/revdeplib/data.table/libs/datatable.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=de6181181b58a2decc364521566d77b20a3d37d7, with debug_info, not stripped
>

parayamelo · 2019-05-01T01:29:03Z

I closed everything and opened again in another R-session. I am getting now
"Opened 168.7GB file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available."
Same error if I try to fread the file with 1 column (select=c(1)) or small number of rows (nrows=10000). I guess there is not enough virtual memory, but the issue is the same when trying to read a small number of files?
I saw someone else posting the same issue:
#2321

.Machine$sizeofpointer = 8

There is an issue with the copy/paste from the server, I have to post outputs manually.

jangorecki · 2019-05-01T09:43:37Z

So the initial issue related to 32bit was most likely a result of different R processes running in each case, as Matt suggested. For example you might have R setup in PATH env var and different R in your shortcut/alias.
So the problem narrows down to the inability to open just part of the file, as opening whole file couldn't work due to memory limitation. As you notice it is the same issue as #2321 which should be already resolved. Ultimately providing file would allow us to investigate this case further.
What you could eventually try is to confirm that fread("file.csv", nrows=10000) raise error while fread("head -n 10001 file.csv") does not. Of course using 64bit. I suggest to avoid using 32bit R if possible.

parayamelo · 2019-05-01T15:51:41Z

Yes, I guess there was an issue with different R processes.
Running on 64bit process, I get the same mapping memory error. When I do fread("file.csv", nrows=10000), I get "Opened 168.7GB file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available."
When I do fread("head -n 10001 file.csv"), I get an error of "head is not a recognized as an internal or external command".

mattdowle · 2019-05-01T15:52:19Z

Yes, glad you're using 64bit R ok now.
Given the file is 170GB and you have 64GB of RAM, you would probably have to select a subset of columns anyway, even if it did memory map it.
You'll either need to split the single 170GB file up into pieces (by date for example), use a server with more RAM (e.g. EC2 X1 1TB), or use a database. It's quite unheard of (and considered bad practice) to have a single file so large.

parayamelo · 2019-05-01T16:23:11Z

Yes, I understand that I will have to select a subset of columns. When I do that with fread, I get the same error. So, my question is, does fread always have to map the entire file to memory?

"It's quite unheard of (and considered bad practice) to have a single file so large." --> Totally agree, unfortunately I do not generate this file. I am trying to get the file being split before any analysis.

jangorecki · 2019-05-01T16:33:51Z

@parayamelo Start with installing some bash for windows, it makes life much easier and increase productivity, especially with software like R, and actually most of open source in general. One of those two should be best https://stackoverflow.com/questions/771756/what-is-the-difference-between-cygwin-and-mingw

parayamelo · 2019-05-01T16:37:26Z

Good idea @jangorecki . I will tell the people responsible for the server. I use Linux, so I am more use to bash commands. I am trying to replicate the same error on my Linux machine.

mattdowle · 2019-05-01T18:14:35Z

So, my question is, does fread always have to map the entire file to memory?

Yes: virtual memory though and it needs to be a contiguous block. The error message correctly states: "There is probably not enough contiguous virtual memory available." It would be technically possible to memory map in chunks but that would complicate the algorithm considerably. I think our time is better spent elsewhere.

I'd expect a 60-80 GB file to memory map ok on your 64GB RAM server, using Windows virtual memory. But 170GB is almost 3x the RAM. There might be some Windows configuration settings you could investigate to increase virtual memory. Then select a small subset of columns allowing the for the final data.table in RAM too of course. Unlikely to work but might be worth a shot.

parayamelo · 2019-05-01T18:24:25Z

Thanks Matt! I will look for alternative solutions. And thanks for taking the time to look into my issue.

xiaodaigh · 2019-09-03T07:06:52Z

Sorry for hacking the thread.

@parayamelo perhaps you want to try disk.frame? http://diskframe.com It will handle large datasets, please let me know if run into bugs.

parayamelo · 2019-09-03T16:47:53Z

Thank you @xiaodaigh
Will take a look into it.

jangorecki added the fread label Apr 29, 2019

mattdowle closed this as completed May 1, 2019

xiaodaigh mentioned this issue Sep 4, 2019

Bug when importing 30GB csv file DiskFrame/disk.frame#141

Closed

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

srosh2000 mentioned this issue Aug 3, 2022

Test workflow hannesdatta/covid-19-book-consumption#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory mapping/nrows #3526

Memory mapping/nrows #3526

parayamelo commented Apr 28, 2019

mattdowle commented Apr 30, 2019 •

edited

Loading

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 •

edited

Loading

parayamelo commented May 1, 2019

jangorecki commented May 1, 2019 •

edited

Loading

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 •

edited

Loading

parayamelo commented May 1, 2019

jangorecki commented May 1, 2019

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 •

edited

Loading

parayamelo commented May 1, 2019

xiaodaigh commented Sep 3, 2019

parayamelo commented Sep 3, 2019

Memory mapping/nrows #3526

Memory mapping/nrows #3526

Comments

parayamelo commented Apr 28, 2019

mattdowle commented Apr 30, 2019 • edited Loading

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 • edited Loading

parayamelo commented May 1, 2019

jangorecki commented May 1, 2019 • edited Loading

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 • edited Loading

parayamelo commented May 1, 2019

jangorecki commented May 1, 2019

parayamelo commented May 1, 2019

mattdowle commented May 1, 2019 • edited Loading

parayamelo commented May 1, 2019

xiaodaigh commented Sep 3, 2019

parayamelo commented Sep 3, 2019

mattdowle commented Apr 30, 2019 •

edited

Loading

mattdowle commented May 1, 2019 •

edited

Loading

jangorecki commented May 1, 2019 •

edited

Loading

mattdowle commented May 1, 2019 •

edited

Loading

mattdowle commented May 1, 2019 •

edited

Loading