-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] Improve change point detection for long bucket lengths #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve change point detection for long bucket lengths #95
Conversation
double p{CTools::logisticFunction(x[0], 0.05, 1.0) * | ||
CTools::logisticFunction(x[1], 0.1, 1.0) * | ||
(x[2] < 0.0 ? 1.0 : CTools::logisticFunction(x[2], 0.2, 1.0)) * | ||
CTools::logisticFunction(x[3], 0.2, 0.5)}; | ||
LOG_TRACE("p(" << (*candidates[0].second)->change()->print() << ") = " << p | ||
<< " | x = " << core::CContainerPrinter::print(x)); | ||
LOG_TRACE(<< "df(" << (*candidates[0].second)->change()->print() << ") = " << p / 0.03125 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make it clearer to the reader to use MAXIMUM_DECISION_FUNCTION * p
rather than p / 0.03125
even though they evaluate to the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
m_Windows[test]->initialize(time); | ||
void CTimeSeriesDecompositionDetail::CPeriodicityTest::maybeClear(core_t::TTime time, | ||
double shift) { | ||
for (auto& test : {E_Short, E_Long}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it just be auto
for iterating simple enum values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an oversight from originally iterating over the windows rather than the enums
values.push_back(CBasicStatistics::mean(value)); | ||
} | ||
} | ||
if (shift > 1.4826 * CBasicStatistics::mad(values)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic 1.4826
could be a symbolic constant or at least commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corrects for bias when estimating the standard deviation (for normal data). I've pulled out the constant as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
do not clear periodicity test after detecting a changepoint: Internal QA test revealed a problem with deleting the periodicity test after detecting a changepoint, causing a regression in an internal dataset. This fix removes the clearance part of changepoint detection (only a small part of cp detection, see #95), effectively restoring the old behavior.
…stic#159) do not clear periodicity test after detecting a changepoint: Internal QA test revealed a problem with deleting the periodicity test after detecting a changepoint, causing a regression in an internal dataset. This fix removes the clearance part of changepoint detection (only a small part of cp detection, see elastic#95), effectively restoring the old behavior.
…#160) do not clear periodicity test after detecting a changepoint: Internal QA test revealed a problem with deleting the periodicity test after detecting a changepoint, causing a regression in an internal dataset. This fix removes the clearance part of changepoint detection (only a small part of cp detection, see #95), effectively restoring the old behavior.
This PR does some hardening to change detection based on further testing, particularly for longer bucket lengths. By way of example, the following is the current behaviour on a problematic data set:



30m bucketing
1h bucketing
2h bucketing
After these changes we get:



30m bucketing
1h bucketing
2h bucketing
I have committed these changes incrementally in logical groups, so it may be easier to review them a commit at a time.
The following are made in the last commit:
Making these changes also uncovered an improvement to make to periodicity testing following on from #90. Specifically, we should have been weighting the samples when computing the level for the amplitude test to better deal with outliers. This is made in the first commit.
This can affect results for data with change points particularly if using longer bucket lengths. Marking as a non-issue since this is a refinement to an unreleased version.