[ML] Improve partition analysis memory usage #97

tveasey · 2018-05-17T17:43:07Z

Partitioning an analysis using "partition" vs "by" we pay up to 100% increase in memory usage per "partition of the data". A major contributor to this is the cost of the interim bucket corrector.

We estimate the count we expect in a bucket and correct its statistics for interim results if the count is less than expected (to account for missing data). For by analysis we use one estimate for the bucket as a whole; for partition analysis we use one estimate for each partition. This ends up contributing around 60% of the extra memory cost, since the size of the model for the expected count is relatively large.

I think from a QoR perspective, estimating the overall bucket count is actually likely to be more effective since partitions are likely to have significantly greater variance in their count and so any prediction of their expected count would have significantly larger uncertainty. However, the real win is the reduced memory usage. Since any correction we perform is approximate, we don't know what data we'll subsequently receive, I think we must change our strategy here.

I get a pretty consistent relative memory improvement for a range of partition cardinalities and data characteristics:

# partitions	Old model memory	New model memory	% improvement
10	1.4MB	1MB	28.6
20	1MB	0.72MB	28
78	4.1MB	3MB	26.8
16925	559MB	397MB	28.9

The basic strategy is to move to one shared interim bucket corrector, which is only updated by the (single) counting model. (Note that there is a very small error in the attributed model memory, since there is a contribution to the reference count of 1 from the model config class which is not accounted for in the model memory calculation. In the case where the estimate is important, many partitions of the data, this error is small so I prefer to keep the implementation clean.) I don't think it is worthwhile trying to upgrade this part of the model state, it is good enough to reinitialise the interim bucket corrector state when the model is upgraded, so I haven't bothered to make CInterimBucketCorrector restore backwards compatible as we won't try and restore these objects.

This will affect interim results for partition analysis. Also, since it changes the model memory significantly I would expect to get different results on jobs which hit the hard memory limit. Otherwise, the results should be unchanged.

I split the change into two commits, functional changes and test changes, since the mechanical changes need to the tests were reasonably large. Also, since I was forced to update quite a lot of code anyway, I took the opportunity try and streamline model creation in the affected unit tests. There is still quite a lot of copy paste boilerplate code here, which would benefit from further refactor, but that can be the subject of separate PR.

include/model/CInterimBucketCorrector.h


-    //! Updates the model of the bucket count with a new measurement
-    void update(core_t::TTime time, std::size_t bucketCount);
+    //! Get an estimate of the current bucket completeness.


hendrikmuhs · 2018-05-18T06:36:44Z

lib/model/CModelFactory.cc

@@ -280,6 +285,10 @@ void CModelFactory::swap(CModelFactory& other) {
    m_InfluenceCalculatorCache.swap(other.m_InfluenceCalculatorCache);
 }

+CModelFactory::TInterimBucketCorrectorPtr CModelFactory::interimBucketCorrector() const {
+    return TInterimBucketCorrectorPtr{m_InterimBucketCorrector};


throws if m_InterimBucketCorrector (weak ptr) is gone

As below, but maybe at least warrants a comment.

hendrikmuhs · 2018-05-18T06:43:38Z

include/model/CModelFactory.h

+    //! A reference to the singleton interim bucket correction calculator.
+    //!
+    //! \note That this is stored by weak pointer since we don't want this
+    //! to update the reference count or our memory accounting will be wrong.


I have bad feelings about the usage of weak pointers here, just for the purpose of workarounding memory accounting problems. I know this is a difficult problem but we should rather work on the root cause, than introducing such a workaround. (Although I guess that for this case its benign, I guess the logic ensures that the pointer never gets invalid before this class gets destructed)

The ownership model of this object is actually very simple. The model config object, which really "owns it", is created early in main and is never deleted. So there is no chance that this goes away. All references to this could just as well be bare pointers, including those in the models. However, the reference counting is a convenient mechanism for dividing up the attribution of the memory of the interim bucket corrector among the different models. This means in memory debug we get sensible attribution of the memory to each individual model.

We don't have much precedence for globally owned state like this. One example is the model params which is owned by the model factories which are themselves owned by the model config and held by reference in the individual models. So equally dangerous if this object was deleted. I could equally hold this by reference to the single instance of the shared pointer here, but I don't feel that is an improvement. I thought about different ownership strategies, but didn't come up with a better one.

I guess as a minimum I should document this better in the code, including the assumption of any code which uses the factories that they own the interim bucket corrector. But definitely willing to consider alternative suggestions.

An alternative, I'm not sure I prefer it, would be to make something like a non-owning counted reference object. We understand that such an object doesn't imply shared ownership, so doesn't interfere with the referred to object lifecycle, but it does maintain a count of references which have been created. We'd need a weak version of this as well with similar semantics, but which doesn't throw when upgrading to a full reference. (This is awfully like a shared_ptr/weak_ptr, so I'm not sure I want to go to the effort of writing this class.)

dimitris-athanasiou

This is a great one Tom! I just have a clarification question.

dimitris-athanasiou · 2018-05-18T08:33:39Z

include/model/CInterimBucketCorrector.h

-    //! Updates the model of the bucket count with a new measurement
-    void update(core_t::TTime time, std::size_t bucketCount);
+    //! Get an estimate of the current bucket completeness.
+    double completeness() const;


In case of latency this will return the completeness of the last bucket for which estimateBucketCompleteness is called, right?

That's right. This is just a cache of the result of completeness calculation which is populated in the CCountingModel::sampleBucketStatistics. We should call all these sample methods before using the completeness estimate and these are always called when getting interim results so this really just to avoid us calling estimateBucketCompleteness for each individual model, just to compute the same value.

This is actually a convenient optimisation, because I noticed that evaluating the trend cropped up high when profiling recently, i.e. #73.

… each partitioning style

hendrikmuhs

LGTM

…arning

dimitris-athanasiou

LGTM2

Move to one interim bucket corrector, which is only updated by the (single) counting model, and is shared by all models.

Use a singleton global interim bucket corrector

17cc905

tveasey added >enhancement v7.0.0 review :ml v6.4.0 labels May 17, 2018

tveasey requested review from droberts195 and dimitris-athanasiou May 17, 2018 17:43

hendrikmuhs reviewed May 18, 2018

View reviewed changes

include/model/CInterimBucketCorrector.h

//! Updates the model of the bucket count with a new measurement

void update(core_t::TTime time, std::size_t bucketCount);

//! Get an estimate of the current bucket completeness.

This comment was marked as resolved.

Sign in to view

hendrikmuhs reviewed May 18, 2018

View reviewed changes

dimitris-athanasiou reviewed May 18, 2018

View reviewed changes

tveasey force-pushed the enhancement/improve-partition-memory branch from d89df48 to a580ae4 Compare May 18, 2018 09:50

Fix unit test fallout

52868ba

tveasey force-pushed the enhancement/improve-partition-memory branch from a580ae4 to 52868ba Compare May 18, 2018 12:48

Unit testing of the number of partitions of the data we can model for…

a2a41af

… each partitioning style

tveasey force-pushed the enhancement/improve-partition-memory branch from a3c3451 to a2a41af Compare May 18, 2018 14:01

tveasey added 2 commits May 18, 2018 17:43

Review comments

267305c

Update change log

1413924

hendrikmuhs approved these changes May 22, 2018

View reviewed changes

Gracefully exit if the interim bucket is not available and suppress w…

5e5d9b5

…arning

tveasey force-pushed the enhancement/improve-partition-memory branch from f7eb8c4 to 5e5d9b5 Compare May 22, 2018 10:18

dimitris-athanasiou approved these changes May 22, 2018

View reviewed changes

tveasey removed the request for review from droberts195 May 22, 2018 11:05

tveasey merged commit e0d176e into elastic:master May 22, 2018

tveasey added a commit that referenced this pull request May 22, 2018

[ML] Improve partition analysis memory usage (#97)

18536f9

Move to one interim bucket corrector, which is only updated by the (single) counting model, and is shared by all models.

tveasey mentioned this pull request May 23, 2018

[ML] Store expanding window bucket values in a compressed format #100

Merged

sophiec20 added :ml and removed :ml labels Jun 12, 2018

tveasey deleted the enhancement/improve-partition-memory branch December 13, 2018 08:40

dimitris-athanasiou mentioned this pull request Feb 26, 2019

[ML] Unexpected interim results after advancing time into new empty bucket #324

Closed

droberts195 mentioned this pull request May 18, 2022

[ML] Improve time series model stability and anomaly scoring consistency for sparse data #2267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Improve partition analysis memory usage #97

[ML] Improve partition analysis memory usage #97

Uh oh!

tveasey commented May 17, 2018 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

hendrikmuhs May 18, 2018

Uh oh!

tveasey May 18, 2018

Uh oh!

hendrikmuhs May 18, 2018

Uh oh!

tveasey May 18, 2018 •

edited

Loading

Uh oh!

tveasey May 18, 2018

Uh oh!

dimitris-athanasiou left a comment

Uh oh!

dimitris-athanasiou May 18, 2018

Uh oh!

tveasey May 18, 2018

Uh oh!

hendrikmuhs left a comment

Uh oh!

dimitris-athanasiou left a comment

Uh oh!

Uh oh!

[ML] Improve partition analysis memory usage #97

[ML] Improve partition analysis memory usage #97

Uh oh!

Conversation

tveasey commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

hendrikmuhs May 18, 2018

Choose a reason for hiding this comment

Uh oh!

tveasey May 18, 2018

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs May 18, 2018

Choose a reason for hiding this comment

Uh oh!

tveasey May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey May 18, 2018

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou May 18, 2018

Choose a reason for hiding this comment

Uh oh!

tveasey May 18, 2018

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs left a comment

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tveasey commented May 17, 2018 •

edited

Loading

tveasey May 18, 2018 •

edited

Loading