An implementation of the imperative learning to search framework [1] in pytorch, compatible with automatic differentiation, for deep learning-based structured prediction and reinforcement learning.
[1] http://hal3.name/docs/daume16compiler.pdf
The basic structure is:
macarico/
base.py defines the abstract classes used for maçarico,
such as Env, Policy, Features, Learner, Attention
annealing.py tools for annealing, useful for eg DAgger
util.py basic utility functions
tasks/ example tasks, such as: sequence_labeler,
dependency_parser, sequence2sequence, etc. all of
these define an Env that can be run
features/ contains example static features and dynamic features
sequence.py defines two types of static features: RNNFeatures
(obtained by running an RNN over the input) and
BOWFeatures (simple bag of words). also defines
useful attention models over sequences.
actor.py defines two types of dynamic features: TransitionRNN
(which is an actor that has an RNN-like hidden state),
and TransitionBOW (which has no hidden state and
instead just conditions on the previous actions)
policies/ currently only implements a linear policy
lts/ various learning to search algorithms, such as:
maximum_likelihood, dagger, reinforce,
aggrevate and LOLS
tests/
run_tests.sh run all (or some) of the tests, compare the outputs
to previous versions to make sure you didn't botch
anything. (*please* run this before pushing changes.)
test_util.py some utilities for running tests, such as train/eval
loops, printint outputs, etc.
nlp_data.py generate or load data for various natural language
processing tasks. (requires external data.)
test_X.py various tests for different parts of maçarico. if you
develop something new, please create a test!
output/ outputs from prvious runs
Take a look at existing tasks.
Create a new mytask.py file in macarico/tasks that defines:
-
an
Example
class, which contains labeled examples for your task. This class must define amk_env
function that returns an environment (Env
) particular to this task. -
an
Env
class, which defines how your environment works. It must providerun_episode
andloss
at the minimum. For some learning algorithms, it must providerewind
(mostly for efficiency) and/orreference
. -
if none of the existing attention models make sense for your task (this is the case for e.g.
DependencyParser
), define your ownAttention
mechanism. -
make a test case in tests/test_mytask.py that tests it.
There are two types of features: static features (things that can be precomputed on the input before the environment starts running) and dynamic features (things that depend on the status of the environment).
For static features (like RNNFeatures
), create a class the derives
from macarico.Features
(and probably also from nn.Module
if it has
any of its own parameters). At a minimum, this must define its
dimensionality and give a name to itself (called the field
). This
field
can then be referenced either by other features or by
Attention
modules. This should defined a _forward
method that
computes the static features, and which will be cached automatically
for you. It should return a tensor of dimension (M,dim), where M is
arbitrary (but which must be compatible with Attention
) and where
dim is the pre-declared dimensionality.
For dynamic features (like TransitionRNN
), create a class as before.
However, instead of defining the static _forward
function, you must
define your own dynamic forward
function. This can peek at
state.t
and state.T
to get the current and maximum time step. It
should return features /just/ for the current timestep, state.t
.
An Attention
mechanism tells a dynamic model where to look to access
its features. There are type types: hard attention and soft attention.
A hard attention mechanism defines its field (which features it is
attending to) and its arity (how many feature vectors does it attend
to at any given time). Then, at runtime, given a state
, it must
return the indices into the corresonding fields based on the state,
where the number of indices is exactly equal to its arity.
A soft attention mechanism still defines its field but declares its
arity to be None
. This means that instead of returning an /index/
into its input, it must return a /distribution/ over its input as a
torch Variable tensor.
The most basic type of Learner
basically behaves like a Policy
,
but additionally provides an update
function that, for instance,
does backprop.
Perhaps the simplest example is MaximumLikelihood
, which just
behaves according to a reference policy, but accumulates an objective
function that's the sum of individual predictions. At update
time,
it runs backprop on this objective.
One very important thing in Learners, is that even if they do not
use the return value of their underlying policy, they must call the
underlying policy every time they run. Why? Because the underlying
policy may accumulate state (as in the case of TransitionRNN
) and if
it is "skipped" the policy will become very confused because it will
have missed some input.
A slightly different example is Reinforce
, which implements the
reinforce RL algorithm. This Learner does not explicitly accumulate an
objective that it then backprops on; instead it uses the fact that
stochastic choices can be backpropped through automatically using
torch's .reinforce
function.
Because we have designed maçarico to be as modular as possible, there are some places where the different pieces need to "talk" to each other.
Let's take test_sequence_labeler.py
as an example. In test_demo
,
we have code that looks like:
data = [Example(x, y, n_labels) for x, y in ...]
This constructs sequence_labeler.Example
data structures and calls
them data. If you look at the Example
data structure, you find it
has two main components: tokens
and labels
, corresponding to x
and y
respectively, above.
Next, we build some static features:
features = RNNFeatures(n_types,
input_field = 'tokens',
output_field = 'tokens_feats')
This constructs a biLSTM over the inputs. Where does it look? It looks
in tokens
because that's the specified input field. And it stores
the features generated by the biLSTM in tokens_feats
. You can
therefore think of features
as something that maps from tokens
to
tokens_feats`. (Note: those two arguments are the default and could
have been left off for convenience, but here we're trying to make
everything explicit.)
Next, we need an actor. The actor is the thing that takes a state of
the world and produces a feature representation. (This feature
representation will later be consumed by the Policy
to predict an
action.) However, the actor needs to attend somewhere when making
predictions. In this case, when the environment (the sequence labeler)
is predicting the label of the n
th word, the actor should look at
that word! This can be done with the AttendAt
attention mechanism.
attention = AttendAt(field='tokens_feats',
get_position=lambda state: state.n)
This constructs an attention mechanism that essentially returns
tokens_feats[state.n]
when the environment state is on word
n
. Note that this hinges on the fact that we know that the
environment stores "current position" in state.n
. (Again, these
arguments are the default and could be left off.)
Next, we can construct the actor. The actor itself an RNN (not bidirectional this time), which uses the biLSTM features we build above, together with the simple attention mechanism.
actor = TransitionRNN([features],
[attention],
n_labels)
Finally, we can construct the policy. In this case, it's just a linear
function that maps from the actor
's feature represention to one of
n_labels
actions:
policy = LinearPolicy(actor, n_labels)
Tracing back through this. The policy maps from a state feature
representation to an action. This mapping is done by
LinearPolicy
. But where does the state feature representation come
from? It comes from the actor
, which, when labeling word n
asks
its attention model(s) what features to use. In this case, the
attention tells it to look at tokens_feats[n]
, where
tokens_feats[n]
is the output of the biLSTM.
We will now train this model with DAgger. In order to do this, we need to anneal the degree to which rollin is done according to the reference policy versus the learned policy:
p_rollin_ref = stochastic(ExponentialAnnealing(0.99))
Next, we construct an optimizer. This is exactly like you would do in
pytorch, extracting parameters from the policy
:
optimizer = torch.optim.Adam(policy.parameters(), lr=0.01)
And now we can train:
for epoch in xrange(5):
# train on each example, one at a time
for ex in data:
optimizer.zero_grad()
learner = DAgger(ref, policy, p_rollin_ref)
env = ex.mk_env()
env.run_episode(learner)
learner.update(env.loss())
optimizer.step()
p_rollin_ref.step()
# now make some predictions
for ex in data:
env = ex.mk_env()
out = env.run_episode(policy)
print 'prediction = %s' % out