Running Ray on a cluster is still experimental.
Ray can be used in several ways. In addition to running on a single machine, Ray is designed to run on a cluster of machines. This document is about how to use Ray on a cluster.
This section describes how to start a cluster on EC2. These instructions are copied and adapted from https://github.com/amplab/spark-ec2.
- Create an Amazon EC2 key pair for yourself. This can be done by logging into
your Amazon Web Services account through the AWS
console, clicking Key Pairs on the left
sidebar, and creating and downloading a key. Make sure that you set the
permissions for the private key file to
600
(i.e. only you can read and write it) so thatssh
will work. - Whenever you want to use the
ec2.py
script, set the environment variablesAWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
to your Amazon EC2 access key ID and secret access key. These can be obtained from the AWS homepage by clicking Account > Security Credentials > Access Credentials.
-
Go into the
ray/scripts
directory. -
Run
python ec2.py -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
, where<keypair>
is the name of your EC2 key pair (that you gave it when you created it),<key-file>
is the private key file for your key pair,<num-slaves>
is the number of slave nodes to launch (try 1 at first), and<cluster-name>
is the name to give to your cluster.For example:
export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123 python ec2.py --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 launch my-ray-cluster
The following options are worth pointing out:
--instance-type=<instance-type>
can be used to specify an EC2 instance type to use. For now, the script only supports 64-bit instance types, and the default type ism3.large
(which has 2 cores and 7.5 GB RAM).--region=<ec2-region>
specifies an EC2 region in which to launch instances. The default region isus-east-1
.--zone=<ec2-zone>
can be used to specify an EC2 availability zone to launch instances in. Sometimes, you will get an error because there is not enough capacity in one zone, and you should try to launch in another.--spot-price=<price>
will launch the worker nodes as Spot Instances, bidding for the given maximum price (in dollars).
These instructions work on EC2, but they may require some modifications to run
on your own cluster. In particular, on EC2, running sudo
does not require a
password, and we currently don't handle the case where a password is needed.
-
If you launched a cluster using the
ec2.py
script from the previous section, then the fileray/scripts/nodes.txt
will already have been created. Otherwise, create a filenodes.txt
of the IP addresses of the nodes in the cluster. For example12.34.56.789 12.34.567.89
The first node in the file is the "head" node. The scheduler will be started on the head node, and the driver should run on the head node as well.
-
Make sure that the nodes can all communicate with one another. On EC2, this can be done by creating a new security group and adding the inbound rule "all traffic" and adding the outbound rule "all traffic". Then add all of the nodes in your cluster to that security group.
-
Run something like
python scripts/cluster.py --nodes nodes.txt \ --key-file key.pem \ --username ubuntu \ --installation-directory /home/ubuntu/
where you replace nodes.txt
, key.pem
, ubuntu
, and /home/ubuntu/
by the
appropriate values. This assumes that you can connect to each IP address
<ip-address>
in nodes.txt
with the command
ssh -i <key-file> <username>@<ip-address>
4. The previous command should open a Python interpreter. To install Ray on the
cluster, run cluster.install_ray()
in the interpreter. The interpreter should
block until the installation has completed. The standard output from the nodes
will be redirected to your terminal.
5. To check that the installation succeeded, you can ssh to each node, cd into
the directory ray/test/
, and run the tests (e.g., python runtest.py
).
6. Start the cluster (the scheduler, object stores, and workers) with the
command cluster.start_ray("~/example_ray_code")
, where the argument is
the local path to the directory that contains your Python code. This command will
copy this source code to each node and will start the cluster. Each worker that
is started will have a local copy of the ~/example_ray_code directory in their
PYTHONPATH. After completing successfully, you can connect to the ssh to the
head node and attach a shell to the cluster by first running the following code.
cd "$RAY_HOME/../user_source_files/example_ray_code"; source "$RAY_HOME/setup-env.sh";
Then within Python, run the following.
python import ray ray.init(scheduler_address="12.34.56.789:10001", objstore_address="12.34.56.789:20001", driver_address="12.34.56.789:30001")
-
Note that there are several more commands that can be run from within
cluster.py
.cluster.install_ray()
- This pulls the Ray source code on each node, builds all of the third party libraries, and builds the project itself.cluster.start_ray(user_source_directory, num_workers_per_node=10)
- This starts a scheduler process on the head node, and it starts an object store and some workers on each node.cluster.stop_ray()
- This shuts down the cluster (killing all of the processes).cluster.update_ray()
- This pulls the latest Ray source code and builds it.
Once you've started a Ray cluster using the above instructions, to actually use
Ray, ssh to the head node (the first node listed in the nodes.txt
file) and
attach a shell to the already running cluster.