Skip to content

[ML] Improve time series decomposition in the presence of change points #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Sep 4, 2018

The problems related to #159 showed up because of a fundamental issue with the way we decompose a time series in the presence of change points. In particular, we don't consider that the window of values we sketch might include common types of change points, that we've handled in the rest of our modelling since #92. There are two potential issues as a result:

  1. We aren't able to find significant components of the time series because they are distorted by changes
  2. If we do detect a periodic component we can get poorly initialised components as a result (the unit test disabled by [ML] do not clear periodicity test after detecting a changepoint #159 is an example of this problem, which this change re-enables)

This PR tackles this problem head on by allowing for the common types of change point we handle, linear scaling and piecewise linear trend, when testing for and initialising new components.

The bulk of this change is in code to perform and test top-down segmentation of a window of values, CTimeSeriesSegmentation, allowing for either piecewise linear trend or piecewise constant modulation of a seasonal component. We use our standard approach of doing model selection based statistical significance of the explained variance to choose the segmentation. (We can do this relatively easily because each candidate partition creates a nested hypothesis.) On this front, although bottom up partitioning is generally considered preferable, this is primarily concerned with dealing with large change events, which will be identified by a top-down approach. It is also much more compute efficient and we do this multiple times for different hypotheses we test to find the best decomposition.

times.push_back(time);
trend.push_back(weight * daily);
}
for (auto offset : {0 * DAY, 5 * DAY}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the same test performed for different offsets of the start time: the bulk of the changes are indentation.


stats.s_V0 = CBasicStatistics::variance(moments);
stats.s_DF0 = b + static_cast<double>(stats.s_Segmentation.size() - 2);
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old function was this branch.

if (stats.s_Segmentation.size() > 0) {
buckets = CTimeSeriesSegmentation::removePiecewiseLinearScaledPeriodic(
buckets, stats.s_Segmentation, stats.s_T0[0], stats.s_Scales);
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old function was this branch.

return this->testVariance(window, values, period_, df1, v1, stats, R,
meanRepeats, pVariance) ||
this->testAmplitude(window, values, period_, b, v, R, meanRepeats,
pVariance, stats);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I factored out testVariance and testAmplitude from this function (for reuse in CPeriodicityHypothesisTests::testPeriodWithScaling). These are what is being compared to the following code.

@tveasey tveasey requested a review from edsavage September 28, 2018 08:29
Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me Tom, just a few questions here and there...


private:
using THypothesisVec = std::vector<CNestedHypotheses>;

private:
//! The test.
TTestFunc m_Test;
//! True if the hypothesis used a piecewise linear trend.
std::size_t m_TrendSegments;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the comments here quite match the type of the member variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This used to be a bool but we actually need the number of segments in the calling code so this changed. I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this commit.

@@ -281,7 +286,7 @@ void project(const TFloatMeanAccumulatorVec& values,
core_t::TTime bucketLength,
TFloatMeanAccumulatorVec& result) {
result.clear();
if (!values.empty()) {
if (values.size() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the change here (and elsewhere)... why not prefer empty over checking size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our standards now say to not use ! but instead == false to match Elastic Java style. I personally find size() > 0 easier to read than empty() == false at a glance. I changed to be more compliant with our style guidelines because I was making quite a few changes in this file.

double B{static_cast<double>(std::count_if(
// Compute the number of non-empty buckets. This needs to account for
// projecting onto windows.
stats.s_B = static_cast<double>(std::count_if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to take the opportunity to rename the data members of STestStats to something more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, thanks Tom

rng.generateNormalSamples(0.0, 1.0, trend.size(), noise);

core_t::TTime startTesting{pad + 28 * HOUR};
//TDoubleVec thresholds[]{TDoubleVec{0.08, 0.08}, TDoubleVec{0.18, 0.15}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: delete this line rather than comment out?

@tveasey tveasey merged commit a8e7cb4 into elastic:master Oct 3, 2018
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Oct 3, 2018
tveasey added a commit that referenced this pull request Oct 4, 2018
@tveasey tveasey deleted the enhancement/improve-timeseries-decomposition-with-change-points branch May 1, 2019 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants