Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory mapping/nrows #3526

Closed
parayamelo opened this issue Apr 28, 2019 · 14 comments
Closed

Memory mapping/nrows #3526

parayamelo opened this issue Apr 28, 2019 · 14 comments
Labels

Comments

@parayamelo
Copy link

Hello,

I am using data.table v1.12.2, and trying to read a file with fread whose size is ~170gb. It is a Windows server machine. It says I have 45.3gb available out of 64gb memory. I get the error
"Opened 168.7GB file ok but could not memory map it. This is a 32bit process. Please upgrade to 64bit."

sessionInfo tells me I am running R v3.5.3 on a 64bit platform. I try to read fewer rows, i.e., option "nrows = 100000", but I always get the same error, i.e., it is mapping to entire file to memory. It is like it is not recognizing the "nrows" option. I also tried with less nthreads (it has maximum 8), but I get the same result. I also tried reading 1 column and fewer rows, but same problem.

Is there a way around this? Due to server policies, I can not install anything new, so my hands are tied to what it is already installed in the server.

Thank you.

@mattdowle
Copy link
Member

mattdowle commented Apr 30, 2019

This part of the error message is reliable: "This is a 32bit process."

sessionInfo tells me I am running R v3.5.3 on a 64bit platform.

Then, either i) wherever you are running sessionInfo() is not the same place you're seeing fread() produce "this is a 32bit process", or ii) you're interpretting the output of sessionInfo() incorrectly. Please post the output of sessionInfo() as the issue template asks please. We ask for that so we can help.

Here are two examples from sessionInfo() on Windows :

Platform: x86_64-w64-mingw32 (64-bit)
Platform: i386-w64-mingw32/i386 (32-bit)

It's the part in brackets at the end that's significant. You need to use the 64bit version of R on your server. Please confirm this solves it.

@parayamelo
Copy link
Author

Hi Matt,

thank you for your message.

sessionInfo() indicates

Platform: x86_64-w64-mingw32 (64-bit).

@mattdowle
Copy link
Member

mattdowle commented May 1, 2019

To double check please post the output of .Machine.
Can you produce the fread message that states "This is a 32bit process" in the same R session that you can get sessionInfo() to report 64bit and .Machine$sizeof.pointer to return 8? I suspect you are running R manually to get sessionInfo() but the fread call is happening in another process: the two R processes are not being started in the same way and that's causing one to be 32bit and one to be 64bit. Could that possibly be the case?
Also, please run test.data.table() and paste the full output.
Please also post the full output of sessionInfo().

You could also try .dynLibs() to find the location of the datatable.dll being loaded into your 64bit Windows R session and make sure it is a 64bit DLL. Maybe there's an installation problem on your server.
Just for example to give you an idea (it will be different on Windows for you), here's some output for me below on Linux. You would have to research how on Windows to find out whether a .dll is 32bit or 64bit. But I didn't think it was possible to load a 32bit DLL into a 64bit process on Windows.

> .dynLibs()
                                                   Filename Dynamic.Lookup
1                /usr/lib/R/library/methods/libs/methods.so          FALSE
2                    /usr/lib/R/library/utils/libs/utils.so          FALSE
3                    /usr/lib/R/library/tools/libs/tools.so          FALSE
4            /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
5              /usr/lib/R/library/graphics/libs/graphics.so          FALSE
6                    /usr/lib/R/library/stats/libs/stats.so          FALSE
7 /home/mdowle/build/revdeplib/data.table/libs/datatable.so          FALSE
> system("file /home/mdowle/build/revdeplib/data.table/libs/datatable.so")
/home/mdowle/build/revdeplib/data.table/libs/datatable.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=de6181181b58a2decc364521566d77b20a3d37d7, with debug_info, not stripped
>

@parayamelo
Copy link
Author

I closed everything and opened again in another R-session. I am getting now
"Opened 168.7GB file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available."
Same error if I try to fread the file with 1 column (select=c(1)) or small number of rows (nrows=10000). I guess there is not enough virtual memory, but the issue is the same when trying to read a small number of files?
I saw someone else posting the same issue:
#2321

.Machine$sizeofpointer = 8

There is an issue with the copy/paste from the server, I have to post outputs manually.

@jangorecki
Copy link
Member

jangorecki commented May 1, 2019

So the initial issue related to 32bit was most likely a result of different R processes running in each case, as Matt suggested. For example you might have R setup in PATH env var and different R in your shortcut/alias.
So the problem narrows down to the inability to open just part of the file, as opening whole file couldn't work due to memory limitation. As you notice it is the same issue as #2321 which should be already resolved. Ultimately providing file would allow us to investigate this case further.
What you could eventually try is to confirm that fread("file.csv", nrows=10000) raise error while fread("head -n 10001 file.csv") does not. Of course using 64bit. I suggest to avoid using 32bit R if possible.

@parayamelo
Copy link
Author

Yes, I guess there was an issue with different R processes.
Running on 64bit process, I get the same mapping memory error. When I do fread("file.csv", nrows=10000), I get "Opened 168.7GB file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available."
When I do fread("head -n 10001 file.csv"), I get an error of "head is not a recognized as an internal or external command".

@mattdowle
Copy link
Member

mattdowle commented May 1, 2019

Yes, glad you're using 64bit R ok now.
Given the file is 170GB and you have 64GB of RAM, you would probably have to select a subset of columns anyway, even if it did memory map it.
You'll either need to split the single 170GB file up into pieces (by date for example), use a server with more RAM (e.g. EC2 X1 1TB), or use a database. It's quite unheard of (and considered bad practice) to have a single file so large.

@parayamelo
Copy link
Author

Yes, I understand that I will have to select a subset of columns. When I do that with fread, I get the same error. So, my question is, does fread always have to map the entire file to memory?

"It's quite unheard of (and considered bad practice) to have a single file so large." --> Totally agree, unfortunately I do not generate this file. I am trying to get the file being split before any analysis.

@jangorecki
Copy link
Member

@parayamelo Start with installing some bash for windows, it makes life much easier and increase productivity, especially with software like R, and actually most of open source in general. One of those two should be best https://stackoverflow.com/questions/771756/what-is-the-difference-between-cygwin-and-mingw

@parayamelo
Copy link
Author

Good idea @jangorecki . I will tell the people responsible for the server. I use Linux, so I am more use to bash commands. I am trying to replicate the same error on my Linux machine.

@mattdowle
Copy link
Member

mattdowle commented May 1, 2019

So, my question is, does fread always have to map the entire file to memory?

Yes: virtual memory though and it needs to be a contiguous block. The error message correctly states: "There is probably not enough contiguous virtual memory available." It would be technically possible to memory map in chunks but that would complicate the algorithm considerably. I think our time is better spent elsewhere.

I'd expect a 60-80 GB file to memory map ok on your 64GB RAM server, using Windows virtual memory. But 170GB is almost 3x the RAM. There might be some Windows configuration settings you could investigate to increase virtual memory. Then select a small subset of columns allowing the for the final data.table in RAM too of course. Unlikely to work but might be worth a shot.

@parayamelo
Copy link
Author

Thanks Matt! I will look for alternative solutions. And thanks for taking the time to look into my issue.

@xiaodaigh
Copy link

Sorry for hacking the thread.

@parayamelo perhaps you want to try disk.frame? http://diskframe.com It will handle large datasets, please let me know if run into bugs.

@parayamelo
Copy link
Author

Thank you @xiaodaigh
Will take a look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants