Skip to content

Latest commit

 

History

History
153 lines (117 loc) · 5.34 KB

README.md

File metadata and controls

153 lines (117 loc) · 5.34 KB

Batch L-BFGS

This document provides a walkthrough of the L-BFGS example. To run the application, first install these dependencies.

Then from the directory ray/examples/lbfgs/ run the following.

source ../../setup-env.sh
python driver.py

Optimization is at the heart of many machine learning algorithms. Much of machine learning involves specifying a loss function and finding the parameters that minimize the loss. If we can compute the gradient of the loss function, then we can apply a variety of gradient-based optimization algorithms. L-BFGS is one such algorithm. It is a quasi-Newton method that uses gradient information to approximate the inverse Hessian of the loss function in a computationally efficient manner.

The serial version

First we load the data in batches. Here, each element in batches is a tuple whose first component is a batch of 100 images and whose second component is a batch of the 100 corresponding labels. For simplicity, we use TensorFlow's built in methods for loading the data.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_size = 100
num_batches = mnist.train.num_examples / batch_size
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]

Now, suppose we have defined a function which takes a set of model parameters theta and a batch of data (both images and labels) and computes the loss for that choice of model parameters on that batch of data. Similarly, suppose we've also defined a function that takes the same arguments and computes the gradient of the loss for that choice of model parameters.

def loss(theta, xs, ys):
  # compute the loss on a batch of data
  return loss

def grad(theta, xs, ys):
  # compute the gradient on a batch of data
  return grad

def full_loss(theta):
  # compute the loss on the full data set
  return sum([loss(theta, xs, ys) for (xs, ys) in batches])

def full_grad(theta):
  # compute the gradient on the full data set
  return sum([grad(theta, xs, ys) for (xs, ys) in batches])

Since we are working with a small dataset, we don't actually need to separate these methods into the part that operates on a batch and the part that operates on the full dataset, but doing so will make the distributed version clearer.

Now, if we wish to optimize the loss function using L-BFGS, we simply plug these functions, along with an initial choice of model parameters, into scipy.optimize.fmin_l_bfgs_b.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)

The distributed version

In this example, the computation of the gradient itself can be done in parallel on a number of workers or machines.

First, let's turn the data into a collection of remote objects.

batch_refs = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]

We can load the data on the driver and distribute it this way because MNIST easily fits on a single machine. However, for larger data sets, we will need to use remote functions to distribute the loading of the data.

Now, lets turn loss and grad into remote functions.

@ray.remote([np.ndarray, np.ndarray, np.ndarray], [float])
def loss(theta, xs, ys):
  # compute the loss
  return loss

@ray.remote([np.ndarray, np.ndarray, np.ndarray], [np.ndarray])
def grad(theta, xs, ys):
  # compute the gradient
  return grad

The only difference is that we added the @ray.remote decorator specifying a little bit of type information (the inputs consist of numpy arrays, loss returns a float, and grad returns a numpy array).

Now, it is easy to speed up the computation of the full loss and the full gradient.

def full_loss(theta):
  theta_ref = ray.put(theta)
  loss_refs = [loss.remote(theta_ref, xs_ref, ys_ref) for (xs_ref, ys_ref) in batch_refs]
  return sum([ray.get(loss_ref) for loss_ref in loss_refs])

def full_grad(theta):
  theta_ref = ray.put(theta)
  grad_refs = [grad.remote(theta_ref, xs_ref, ys_ref) for (xs_ref, ys_ref) in batch_refs]
  return sum([ray.get(grad_ref) for grad_ref in grad_refs]).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.

Note that we turn theta into a remote object with the line theta_ref = ray.put(theta) before passing it into the remote functions. If we had written

[loss.remote(theta, xs_ref, ys_ref) for (xs_ref, ys_ref) in batch_refs]

instead of

theta_ref = ray.put(theta)
[loss.remote(theta_ref, xs_ref, ys_ref) for (xs_ref, ys_ref) in batch_refs]

then each task that got sent to the scheduler (one for every element of batch_refs) would have had a copy of theta serialized inside of it. Since theta here consists of the parameters of a potentially large model, this is inefficient. Large objects should be passed by object reference to remote functions and not by value.

We use remote functions and remote objects internally in the implementation of full_loss and full_grad, but the user-facing behavior of these methods is identical to the behavior in the serial version.

We can now optimize the objective with the same function call as before.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)