🏷️sec_sequence
Imagine that you are watching movies on Netflix. As a good Netflix user, you decide to rate each of the movies religiously. After all, a good movie is a good movie, and you want to watch more of them, right? As it turns out, things are not quite so simple. People's opinions on movies can change quite significantly over time. In fact, psychologists even have names for some of the effects:
- There is anchoring, based on someone else's opinion. For instance, after the Oscar awards, ratings for the corresponding movie go up, even though it is still the same movie. This effect persists for a few months until the award is forgotten. It has been shown that the effect lifts rating by over half a point
:cite:
Wu.Ahmed.Beutel.ea.2017
. - There is the hedonic adaptation, where humans quickly adapt to accept an improved or a worsened situation as the new normal. For instance, after watching many good movies, the expectations that the next movie is equally good or better are high. Hence, even an average movie might be considered as bad after many great ones are watched.
- There is seasonality. Very few viewers like to watch a Santa Claus movie in August.
- In some cases, movies become unpopular due to the misbehaviors of directors or actors in the production.
- Some movies become cult movies, because they were almost comically bad. Plan 9 from Outer Space and Troll 2 achieved a high degree of notoriety for this reason.
In short, movie ratings are anything but stationary. Thus, using temporal dynamics
led to more accurate movie recommendations :cite:Koren.2009
.
Of course, sequence data are not just about movie ratings. The following gives more illustrations.
- Many users have highly particular behavior when it comes to the time when they open apps. For instance, social media apps are much more popular after school with students. Stock market trading apps are more commonly used when the markets are open.
- It is much harder to predict tomorrow's stock prices than to fill in the blanks for a stock price we missed yesterday, even though both are just a matter of estimating one number. After all, foresight is so much harder than hindsight. In statistics, the former (predicting beyond the known observations) is called extrapolation whereas the latter (estimating between the existing observations) is called interpolation.
- Music, speech, text, and videos are all sequential in nature. If we were to permute them they would make little sense. The headline dog bites man is much less surprising than man bites dog, even though the words are identical.
- Earthquakes are strongly correlated, i.e., after a massive earthquake there are very likely several smaller aftershocks, much more so than without the strong quake. In fact, earthquakes are spatiotemporally correlated, i.e., the aftershocks typically occur within a short time span and in close proximity.
- Humans interact with each other in a sequential nature, as can be seen in Twitter fights, dance patterns, and debates.
We need statistical tools and new deep neural network architectures to deal with sequence data. To keep things simple, we use the stock price (FTSE 100 index) illustrated in :numref:fig_ftse100
as an example.
Let us denote the prices by
In order to achieve this, our trader could use a regression model such as the one that we trained in :numref:sec_linear_concise
.
There is just one major problem: the number of inputs,
First, assume that the potentially rather long sequence
The second strategy, shown in :numref:fig_sequence-model
, is to keep some summary
Both cases raise the obvious question of how to generate training data. One typically uses historical observations to predict the next observation given the ones up to right now. Obviously we do not expect time to stand still. However, a common assumption is that while the specific values of
Note that the above considerations still hold if we deal with discrete objects, such as words, rather than continuous numbers. The only difference is that in such a situation we need to use a classifier rather than a regression model to estimate
Recall the approximation that in an autoregressive model we use only
Such models are particularly nice whenever
by using the fact that we only need to take into account a very short history of past observations:
In principle, there is nothing wrong with unfolding
In fact, if we have a Markov model, we can obtain a reverse conditional probability distribution, too. In many cases, however, there exists a natural direction for the data, namely going forward in time. It is clear that future events cannot influence the past. Hence, if we change Hoyer.Janzing.Mooij.ea.2009
. This is great news, since it is typically the forward direction that we are interested in estimating.
The book by Peters et al. has
explained more on this topic :cite:Peters.Janzing.Scholkopf.2017
.
We are barely scratching the surface of it.
After reviewing so many statistical tools,
let us try this out in practice.
We begin by generating some data.
To keep things simple we (generate our sequence data by using a sine function with some additive noise for time steps
%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import nn
npx.set_np()
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import torch
from torch import nn
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
#@tab mxnet, pytorch
T = 1000 # Generate a total of 1000 points
time = d2l.arange(1, T + 1, dtype=d2l.float32)
x = d2l.sin(0.01 * time) + d2l.normal(0, 0.2, (T,))
d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))
#@tab tensorflow
T = 1000 # Generate a total of 1000 points
time = d2l.arange(1, T + 1, dtype=d2l.float32)
x = d2l.sin(0.01 * time) + d2l.normal([T], 0, 0.2)
d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))
Next, we need to turn such a sequence into features and labels that our model can train on.
Based on the embedding dimension
#@tab mxnet, pytorch
tau = 4
features = d2l.zeros((T - tau, tau))
for i in range(tau):
features[:, i] = x[i: T - tau + i]
labels = d2l.reshape(x[tau:], (-1, 1))
#@tab tensorflow
tau = 4
features = tf.Variable(d2l.zeros((T - tau, tau)))
for i in range(tau):
features[:, i].assign(x[i: T - tau + i])
labels = d2l.reshape(x[tau:], (-1, 1))
#@tab all
batch_size, n_train = 16, 600
# Only the first `n_train` examples are used for training
train_iter = d2l.load_array((features[:n_train], labels[:n_train]),
batch_size, is_train=True)
Here we [keep the architecture fairly simple: just an MLP] with two fully-connected layers, ReLU activation, and squared loss.
# A simple MLP
def get_net():
net = nn.Sequential()
net.add(nn.Dense(10, activation='relu'),
nn.Dense(1))
net.initialize(init.Xavier())
return net
# Square loss
loss = gluon.loss.L2Loss()
#@tab pytorch
# Function for initializing the weights of the network
def init_weights(m):
if type(m) == nn.Linear:
nn.init.xavier_uniform_(m.weight)
# A simple MLP
def get_net():
net = nn.Sequential(nn.Linear(4, 10),
nn.ReLU(),
nn.Linear(10, 1))
net.apply(init_weights)
return net
# Note: `MSELoss` computes squared error without the 1/2 factor
loss = nn.MSELoss(reduction='none')
#@tab tensorflow
# Vanilla MLP architecture
def get_net():
net = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1)])
return net
# Note: `MeanSquaredError` computes squared error without the 1/2 factor
loss = tf.keras.losses.MeanSquaredError()
Now we are ready to [train the model]. The code below is essentially identical to the training loop in previous sections,
such as :numref:sec_linear_concise
.
Thus, we will not delve into much detail.
def train(net, train_iter, loss, epochs, lr):
trainer = gluon.Trainer(net.collect_params(), 'adam',
{'learning_rate': lr})
for epoch in range(epochs):
for X, y in train_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()
trainer.step(batch_size)
print(f'epoch {epoch + 1}, '
f'loss: {d2l.evaluate_loss(net, train_iter, loss):f}')
net = get_net()
train(net, train_iter, loss, 5, 0.01)
#@tab pytorch
def train(net, train_iter, loss, epochs, lr):
trainer = torch.optim.Adam(net.parameters(), lr)
for epoch in range(epochs):
for X, y in train_iter:
trainer.zero_grad()
l = loss(net(X), y)
l.sum().backward()
trainer.step()
print(f'epoch {epoch + 1}, '
f'loss: {d2l.evaluate_loss(net, train_iter, loss):f}')
net = get_net()
train(net, train_iter, loss, 5, 0.01)
#@tab tensorflow
def train(net, train_iter, loss, epochs, lr):
trainer = tf.keras.optimizers.Adam()
for epoch in range(epochs):
for X, y in train_iter:
with tf.GradientTape() as g:
out = net(X)
l = loss(y, out)
params = net.trainable_variables
grads = g.gradient(l, params)
trainer.apply_gradients(zip(grads, params))
print(f'epoch {epoch + 1}, '
f'loss: {d2l.evaluate_loss(net, train_iter, loss):f}')
net = get_net()
train(net, train_iter, loss, 5, 0.01)
Since the training loss is small, we would expect our model to work well. Let us see what this means in practice. The first thing to check is how well the model is able to [predict what happens just in the next time step], namely the one-step-ahead prediction.
#@tab all
onestep_preds = net(features)
d2l.plot([time, time[tau:]], [d2l.numpy(x), d2l.numpy(onestep_preds)], 'time',
'x', legend=['data', '1-step preds'], xlim=[1, 1000], figsize=(6, 3))
The one-step-ahead predictions look nice, just as we expected.
Even beyond 604 (n_train + tau
) observations the predictions still look trustworthy.
However, there is just one little problem to this:
if we observe sequence data only until time step 604, we cannot hope to receive the inputs for all the future one-step-ahead predictions.
Instead, we need to work our way forward one step at a time:
$$ \hat{x}{605} = f(x{601}, x_{602}, x_{603}, x_{604}), \ \hat{x}{606} = f(x{602}, x_{603}, x_{604}, \hat{x}{605}), \ \hat{x}{607} = f(x_{603}, x_{604}, \hat{x}{605}, \hat{x}{606}),\ \hat{x}{608} = f(x{604}, \hat{x}{605}, \hat{x}{606}, \hat{x}{607}),\ \hat{x}{609} = f(\hat{x}{605}, \hat{x}{606}, \hat{x}{607}, \hat{x}{608}),\ \ldots $$
Generally, for an observed sequence up to
#@tab mxnet, pytorch
multistep_preds = d2l.zeros(T)
multistep_preds[: n_train + tau] = x[: n_train + tau]
for i in range(n_train + tau, T):
multistep_preds[i] = net(
d2l.reshape(multistep_preds[i - tau: i], (1, -1)))
#@tab tensorflow
multistep_preds = tf.Variable(d2l.zeros(T))
multistep_preds[:n_train + tau].assign(x[:n_train + tau])
for i in range(n_train + tau, T):
multistep_preds[i].assign(d2l.reshape(net(
d2l.reshape(multistep_preds[i - tau: i], (1, -1))), ()))
#@tab all
d2l.plot([time, time[tau:], time[n_train + tau:]],
[d2l.numpy(x), d2l.numpy(onestep_preds),
d2l.numpy(multistep_preds[n_train + tau:])], 'time',
'x', legend=['data', '1-step preds', 'multistep preds'],
xlim=[1, 1000], figsize=(6, 3))
As the above example shows, this is a spectacular failure. The predictions decay to a constant pretty quickly after a few prediction steps.
Why did the algorithm work so poorly?
This is ultimately due to the fact that the errors build up.
Let us say that after step 1 we have some error
Let us [take a closer look at the difficulties in
#@tab all
max_steps = 64
#@tab mxnet, pytorch
features = d2l.zeros((T - tau - max_steps + 1, tau + max_steps))
# Column `i` (`i` < `tau`) are observations from `x` for time steps from
# `i + 1` to `i + T - tau - max_steps + 1`
for i in range(tau):
features[:, i] = x[i: i + T - tau - max_steps + 1]
# Column `i` (`i` >= `tau`) are the (`i - tau + 1`)-step-ahead predictions for
# time steps from `i + 1` to `i + T - tau - max_steps + 1`
for i in range(tau, tau + max_steps):
features[:, i] = d2l.reshape(net(features[:, i - tau: i]), -1)
#@tab tensorflow
features = tf.Variable(d2l.zeros((T - tau - max_steps + 1, tau + max_steps)))
# Column `i` (`i` < `tau`) are observations from `x` for time steps from
# `i + 1` to `i + T - tau - max_steps + 1`
for i in range(tau):
features[:, i].assign(x[i: i + T - tau - max_steps + 1].numpy())
# Column `i` (`i` >= `tau`) are the (`i - tau + 1`)-step-ahead predictions for
# time steps from `i + 1` to `i + T - tau - max_steps + 1`
for i in range(tau, tau + max_steps):
features[:, i].assign(d2l.reshape(net((features[:, i - tau: i])), -1))
#@tab all
steps = (1, 4, 16, 64)
d2l.plot([time[tau + i - 1: T - max_steps + i] for i in steps],
[d2l.numpy(features[:, tau + i - 1]) for i in steps], 'time', 'x',
legend=[f'{i}-step preds' for i in steps], xlim=[5, 1000],
figsize=(6, 3))
This clearly illustrates how the quality of the prediction changes as we try to predict further into the future. While the 4-step-ahead predictions still look good, anything beyond that is almost useless.
- There is quite a difference in difficulty between interpolation and extrapolation. Consequently, if you have a sequence, always respect the temporal order of the data when training, i.e., never train on future data.
- Sequence models require specialized statistical tools for estimation. Two popular choices are autoregressive models and latent-variable autoregressive models.
- For causal models (e.g., time going forward), estimating the forward direction is typically a lot easier than the reverse direction.
- For an observed sequence up to time step
$t$ , its predicted output at time step$t+k$ is the$k$ -step-ahead prediction. As we predict further in time by increasing$k$ , the errors accumulate and the quality of the prediction degrades, often dramatically.
- Improve the model in the experiment of this section.
- Incorporate more than the past 4 observations? How many do you really need?
- How many past observations would you need if there was no noise? Hint: you can write
$\sin$ and$\cos$ as a differential equation. - Can you incorporate older observations while keeping the total number of features constant? Does this improve accuracy? Why?
- Change the neural network architecture and evaluate the performance.
- An investor wants to find a good security to buy. He looks at past returns to decide which one is likely to do well. What could possibly go wrong with this strategy?
- Does causality also apply to text? To which extent?
- Give an example for when a latent autoregressive model might be needed to capture the dynamic of the data.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: