Skip to content

Latest commit

 

History

History
150 lines (114 loc) · 4.99 KB

README.md

File metadata and controls

150 lines (114 loc) · 4.99 KB

Batch L-BFGS

This document provides a walkthrough of the L-BFGS example. To run the application, first install these dependencies.

Then from the directory ray/examples/lbfgs/ run the following.

python driver.py

Optimization is at the heart of many machine learning algorithms. Much of machine learning involves specifying a loss function and finding the parameters that minimize the loss. If we can compute the gradient of the loss function, then we can apply a variety of gradient-based optimization algorithms. L-BFGS is one such algorithm. It is a quasi-Newton method that uses gradient information to approximate the inverse Hessian of the loss function in a computationally efficient manner.

The serial version

First we load the data in batches. Here, each element in batches is a tuple whose first component is a batch of 100 images and whose second component is a batch of the 100 corresponding labels. For simplicity, we use TensorFlow's built in methods for loading the data.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_size = 100
num_batches = mnist.train.num_examples // batch_size
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]

Now, suppose we have defined a function which takes a set of model parameters theta and a batch of data (both images and labels) and computes the loss for that choice of model parameters on that batch of data. Similarly, suppose we've also defined a function that takes the same arguments and computes the gradient of the loss for that choice of model parameters.

def loss(theta, xs, ys):
  # compute the loss on a batch of data
  return loss

def grad(theta, xs, ys):
  # compute the gradient on a batch of data
  return grad

def full_loss(theta):
  # compute the loss on the full data set
  return sum([loss(theta, xs, ys) for (xs, ys) in batches])

def full_grad(theta):
  # compute the gradient on the full data set
  return sum([grad(theta, xs, ys) for (xs, ys) in batches])

Since we are working with a small dataset, we don't actually need to separate these methods into the part that operates on a batch and the part that operates on the full dataset, but doing so will make the distributed version clearer.

Now, if we wish to optimize the loss function using L-BFGS, we simply plug these functions, along with an initial choice of model parameters, into scipy.optimize.fmin_l_bfgs_b.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)

The distributed version

In this example, the computation of the gradient itself can be done in parallel on a number of workers or machines.

First, let's turn the data into a collection of remote objects.

batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]

We can load the data on the driver and distribute it this way because MNIST easily fits on a single machine. However, for larger data sets, we will need to use remote functions to distribute the loading of the data.

Now, lets turn loss and grad into remote functions.

@ray.remote
def loss(theta, xs, ys):
  # compute the loss
  return loss

@ray.remote
def grad(theta, xs, ys):
  # compute the gradient
  return grad

The only difference is that we added the @ray.remote decorator.

Now, it is easy to speed up the computation of the full loss and the full gradient.

def full_loss(theta):
  theta_id = ray.put(theta)
  loss_ids = [loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
  return sum(ray.get(loss_ids))

def full_grad(theta):
  theta_id = ray.put(theta)
  grad_ids = [grad.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]
  return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.

Note that we turn theta into a remote object with the line theta_id = ray.put(theta) before passing it into the remote functions. If we had written

[loss.remote(theta, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]

instead of

theta_id = ray.put(theta)
[loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids]

then each task that got sent to the scheduler (one for every element of batch_ids) would have had a copy of theta serialized inside of it. Since theta here consists of the parameters of a potentially large model, this is inefficient. Large objects should be passed by object ID to remote functions and not by value.

We use remote functions and remote objects internally in the implementation of full_loss and full_grad, but the user-facing behavior of these methods is identical to the behavior in the serial version.

We can now optimize the objective with the same function call as before.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)