This document provides a walkthrough of the L-BFGS example. To run the application, first install these dependencies.
- SciPy
- TensorFlow
Then from the directory ray/examples/lbfgs/
run the following.
python driver.py
Optimization is at the heart of many machine learning algorithms. Much of machine learning involves specifying a loss function and finding the parameters that minimize the loss. If we can compute the gradient of the loss function, then we can apply a variety of gradient-based optimization algorithms. L-BFGS is one such algorithm. It is a quasi-Newton method that uses gradient information to approximate the inverse Hessian of the loss function in a computationally efficient manner.
First we load the data in batches. Here, each element in batches
is a tuple
whose first component is a batch of 100
images and whose second component is a
batch of the 100
corresponding labels. For simplicity, we use TensorFlow's
built in methods for loading the data.
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_size = 100
num_batches = mnist.train.num_examples // batch_size
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]
Now, suppose we have defined a function which takes a set of model parameters
theta
and a batch of data (both images and labels) and computes the loss for
that choice of model parameters on that batch of data. Similarly, suppose we've
also defined a function that takes the same arguments and computes the gradient
of the loss for that choice of model parameters.
def loss(theta, xs, ys):
# compute the loss on a batch of data
return loss
def grad(theta, xs, ys):
# compute the gradient on a batch of data
return grad
def full_loss(theta):
# compute the loss on the full data set
return sum([loss(theta, xs, ys) for (xs, ys) in batches])
def full_grad(theta):
# compute the gradient on the full data set
return sum([grad(theta, xs, ys) for (xs, ys) in batches])
Since we are working with a small dataset, we don't actually need to separate these methods into the part that operates on a batch and the part that operates on the full dataset, but doing so will make the distributed version clearer.
Now, if we wish to optimize the loss function using L-BFGS, we simply plug these
functions, along with an initial choice of model parameters, into
scipy.optimize.fmin_l_bfgs_b
.
theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)
In this example, the computation of the gradient itself can be done in parallel on a number of workers or machines.
First, let's turn the data into a collection of remote objects.
batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]
We can load the data on the driver and distribute it this way because MNIST easily fits on a single machine. However, for larger data sets, we will need to use remote functions to distribute the loading of the data.
Now, lets turn loss
and grad
into remote functions.
@ray.remote
def loss(theta, xs, ys):
# compute the loss
return loss
@ray.remote
def grad(theta, xs, ys):
# compute the gradient
return grad
The only difference is that we added the @ray.remote
decorator.
Now, it is easy to speed up the computation of the full loss and the full gradient.
def full_loss(theta):
theta_id = ray.put(theta)
loss_ids = [loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
return sum(ray.get(loss_ids))
def full_grad(theta):
theta_id = ray.put(theta)
grad_ids = [grad.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.
Note that we turn theta
into a remote object with the line theta_id = ray.put(theta)
before passing it into the remote functions. If we had written
[loss.remote(theta, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
instead of
theta_id = ray.put(theta)
[loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
then each task that got sent to the scheduler (one for every element of
batch_ids
) would have had a copy of theta
serialized inside of it. Since
theta
here consists of the parameters of a potentially large model, this is
inefficient. Large objects should be passed by object ID to remote functions
and not by value.
We use remote functions and remote objects internally in the implementation of
full_loss
and full_grad
, but the user-facing behavior of these methods is
identical to the behavior in the serial version.
We can now optimize the objective with the same function call as before.
theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)