Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca: can't save tensorflow model on yarn cluster #7411

Open
zpeng1898 opened this issue Feb 2, 2023 · 8 comments
Open

Orca: can't save tensorflow model on yarn cluster #7411

zpeng1898 opened this issue Feb 2, 2023 · 8 comments
Assignees

Comments

@zpeng1898
Copy link
Contributor

zpeng1898 commented Feb 2, 2023

Problem

When I want to save tensorflow model(which is a directory) to remote hdfs directory on yarn cluster(cluster_mode="yarn-client") using estimator.save(remote_model_path), an error occurs showing mkdir: permission denied.
So I use TemporaryDirectory to save it to the local temporary directory and then put it to remote directory. But I still can't find the model saved in the temporary directory.
However, I can save it successfully locally on my laptop using estimator.save(os.path.join(model_dir, model_name)) with cluster_mode="local".

  • use est.save(remote_model_dir) directly

code:

remote_model_dir = "hdfs://172.16.0.105:8020/user/kai/pzy/data/NCF"
remote_model_path = os.path.join(remote_model_dir, "NCF_model")
estimator.save(remote_model_path)

error message:

(Worker pid=125051, ip=172.16.0.116) mkdir: Permission denied: user=yarn, access=WRITE, inode="/user/kai/pzy/data/NCF":kai:supergroup:drwxr-xr-x

  • use TemporaryDirectory and then put it to remote_model_dir

code:

def save_tf_model(estimator, model_dir="hdfs://172.16.0.105:8020/user/kai/pzy/data/NCF", model_name="NCF_model"):
    if is_local_path(model_dir):  # save the model to local directory
        estimator.save(os.path.join(model_dir, model_name))
    else:  # save the model to remote directory
        with tempfile.TemporaryDirectory() as tmpdirname:
            local_dir = os.path.join(tmpdirname, model_name)
            remote_dir = os.path.join(model_dir)
            estimator.save(local_dir)
            put_local_dir_tree_to_remote(local_dir, remote_dir)

error message:
image
The error occurs in the line put_local_dir_tree_to_remote(local_dir, remote_dir) because it can't find the model directory saved after estimator.save(local_dir).

@hkvision hkvision added the orca label Feb 2, 2023
@sgwhat sgwhat self-assigned this Feb 3, 2023
@hkvision
Copy link
Contributor

hkvision commented Feb 6, 2023

This seems needed to be fixed. Have we tested this before? @sgwhat

@sgwhat
Copy link
Contributor

sgwhat commented Feb 6, 2023

This seems needed to be fixed. Have we tested this before? @sgwhat

of course, it has been tested before, need to be reproduced.

@sgwhat
Copy link
Contributor

sgwhat commented Feb 9, 2023

  1. The model is saved on the rank 0 ray worker, so it's not property to using the following method

    estimator.save(local_dir)
    put_local_dir_tree_to_remote(local_dir, remote_dir)

    to save a model on driver and send it to hdfs. Currently, we don't plan to support users to save models on local disk when
    they are running on yarn.

  2. When you were using est.save(remote_model_dir) directly, the error you met was a permission issue, try hdfs dfs -chomd -R 777 /path/to/ncf.

@zpeng1898
Copy link
Contributor Author

Thanks. Yes, the problem has been solved by using hdfs dfs -chmod -R 777 /path/to/ncf, and now the program can run normally with est.save(remote_model_dir).

@sgwhat
Copy link
Contributor

sgwhat commented Feb 9, 2023

To do (for myself): add this issue to orca known issue.

@hkvision
Copy link
Contributor

  1. If users want to save saved model to HDFS, which is a directory, they may need to grant access to the directory.
    This can add to known issues?
  2. For yarn client mode, it may be reasonable to support saving the model to the driver:
    • For spark backend, it already supports this.
    • For ray backend, currently the model will only be saved to the rank0 worker, which is not accessible by users. Maybe we can fetch the model to the driver and save to driver in this case?

But as when running in the cluster, users will most likely to save to remote storage, this is not high priority.

@sgwhat
Copy link
Contributor

sgwhat commented Feb 14, 2023

  1. If users want to save saved model to HDFS, which is a directory, they may need to grant access to the directory.
    This can add to known issues?

  2. For yarn client mode, it may be reasonable to support saving the model to the driver:

    • For spark backend, it already supports this.
    • For ray backend, currently the model will only be saved to the rank0 worker, which is not accessible by users. Maybe we can fetch the model to the driver and save to driver in this case?

But as when running in the cluster, users will most likely to save to remote storage, this is not high priority.

  1. Yes, this is necessary to add to known issues.
  2. For yarn-client, it's possible to save a model on driver, we could do this in next step.

@hkvision
Copy link
Contributor

#7623
changed the tf saved format to h5 to avoid the permission issue of a folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants