Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_of_core configuration and documentation #3845

Open
Peji-moghimi opened this issue Dec 11, 2021 · 1 comment
Open

out_of_core configuration and documentation #3845

Peji-moghimi opened this issue Dec 11, 2021 · 1 comment
Labels
documentation 📜 Updates and issues with the documentation External Pull requests and issues from people who do not regularly contribute to modin Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day.

Comments

@Peji-moghimi
Copy link

Peji-moghimi commented Dec 11, 2021

System information

  • OS Distribution: Centos linux 7
  • Memory: 376 GB DDR4
  • Modin version: 0.9.1
  • Ray version: 1.1.0
  • Python version: 3.5.1

I'm still unsure as to what the documentation is suggesting here.

Does the line below, as it stands, disable out-of-core, or does it only disable out-of-core when _plasma_directory=None? If it does disable out-of-core as it stands, how does one specify a desired directory for spilling, instead of the default spilling directory (as I cannot use the default)?

ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray

Currently, the following is my setup:

import ray
def ray_init():    
    ray.init(_temp_dir="/some/specific/path/ray/tmp/", 
             _plasma_directory="/some/specific/path/ray/",
             _memory=3000000000000,
             object_store_memory=3000000000000)
    os.environ["MODIN_ENGINE"] = "ray"
    os.environ['MODIN_OUT_OF_CORE']='true'
    os.environ['MODIN_MEMORY']='365000000000'
    return None

ray_init()

import modin.pandas as pd
from modin.config import ProgressBar
ProgressBar.enable()

df_128gb = pd.read_csv('df_128gb.csv', low_memory=True, memory_map=True)

I just want to know if this is the most memory efficient setup, which would prevent my program running out of RAM, by spilling onto disk, no matter how large the dataframe (within the bounds of my disk space)? And furthermore, does that extend to doing very expensive operations such as merge?

I would really appreciate it if you could settle this for me.

Thanks!
Pej

Originally posted by @Peji-moghimi in #3705 (comment)

@devin-petersohn
Copy link
Collaborator

Hi @Peji-moghimi, thanks for the email. You are right, we should put more clarity in the docs.

Setting _plasma_directory at all will disable the built-in out of core in Ray, but it is still going to be a memory mapped file so it will still be able to use the disk. This will use the operating system to page data to/from memory, which still works fine. We should make the docs more clear on this.

I think your ray.init looks fine, are you running into any specific problem?

@anmyachev anmyachev added the documentation 📜 Updates and issues with the documentation label Apr 21, 2022
@vnlitvinov vnlitvinov added Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day. labels Aug 26, 2022
@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📜 Updates and issues with the documentation External Pull requests and issues from people who do not regularly contribute to modin Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day.
Projects
None yet
Development

No branches or pull requests

4 participants