Skip to content

[ML] Improve partition analysis memory usage #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 22, 2018

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented May 17, 2018

Partitioning an analysis using "partition" vs "by" we pay up to 100% increase in memory usage per "partition of the data". A major contributor to this is the cost of the interim bucket corrector.

We estimate the count we expect in a bucket and correct its statistics for interim results if the count is less than expected (to account for missing data). For by analysis we use one estimate for the bucket as a whole; for partition analysis we use one estimate for each partition. This ends up contributing around 60% of the extra memory cost, since the size of the model for the expected count is relatively large.

I think from a QoR perspective, estimating the overall bucket count is actually likely to be more effective since partitions are likely to have significantly greater variance in their count and so any prediction of their expected count would have significantly larger uncertainty. However, the real win is the reduced memory usage. Since any correction we perform is approximate, we don't know what data we'll subsequently receive, I think we must change our strategy here.

I get a pretty consistent relative memory improvement for a range of partition cardinalities and data characteristics:

# partitions Old model memory New model memory % improvement
10 1.4MB 1MB 28.6
20 1MB 0.72MB 28
78 4.1MB 3MB 26.8
16925 559MB 397MB 28.9

The basic strategy is to move to one shared interim bucket corrector, which is only updated by the (single) counting model. (Note that there is a very small error in the attributed model memory, since there is a contribution to the reference count of 1 from the model config class which is not accounted for in the model memory calculation. In the case where the estimate is important, many partitions of the data, this error is small so I prefer to keep the implementation clean.) I don't think it is worthwhile trying to upgrade this part of the model state, it is good enough to reinitialise the interim bucket corrector state when the model is upgraded, so I haven't bothered to make CInterimBucketCorrector restore backwards compatible as we won't try and restore these objects.

This will affect interim results for partition analysis. Also, since it changes the model memory significantly I would expect to get different results on jobs which hit the hard memory limit. Otherwise, the results should be unchanged.

I split the change into two commits, functional changes and test changes, since the mechanical changes need to the tests were reasonably large. Also, since I was forced to update quite a lot of code anyway, I took the opportunity try and streamline model creation in the affected unit tests. There is still quite a lot of copy paste boilerplate code here, which would benefit from further refactor, but that can be the subject of separate PR.


//! Updates the model of the bucket count with a new measurement
void update(core_t::TTime time, std::size_t bucketCount);
//! Get an estimate of the current bucket completeness.

This comment was marked as resolved.

@@ -280,6 +285,10 @@ void CModelFactory::swap(CModelFactory& other) {
m_InfluenceCalculatorCache.swap(other.m_InfluenceCalculatorCache);
}

CModelFactory::TInterimBucketCorrectorPtr CModelFactory::interimBucketCorrector() const {
return TInterimBucketCorrectorPtr{m_InterimBucketCorrector};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throws if m_InterimBucketCorrector (weak ptr) is gone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As below, but maybe at least warrants a comment.

//! A reference to the singleton interim bucket correction calculator.
//!
//! \note That this is stored by weak pointer since we don't want this
//! to update the reference count or our memory accounting will be wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have bad feelings about the usage of weak pointers here, just for the purpose of workarounding memory accounting problems. I know this is a difficult problem but we should rather work on the root cause, than introducing such a workaround. (Although I guess that for this case its benign, I guess the logic ensures that the pointer never gets invalid before this class gets destructed)

Copy link
Contributor Author

@tveasey tveasey May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ownership model of this object is actually very simple. The model config object, which really "owns it", is created early in main and is never deleted. So there is no chance that this goes away. All references to this could just as well be bare pointers, including those in the models. However, the reference counting is a convenient mechanism for dividing up the attribution of the memory of the interim bucket corrector among the different models. This means in memory debug we get sensible attribution of the memory to each individual model.

We don't have much precedence for globally owned state like this. One example is the model params which is owned by the model factories which are themselves owned by the model config and held by reference in the individual models. So equally dangerous if this object was deleted. I could equally hold this by reference to the single instance of the shared pointer here, but I don't feel that is an improvement. I thought about different ownership strategies, but didn't come up with a better one.

I guess as a minimum I should document this better in the code, including the assumption of any code which uses the factories that they own the interim bucket corrector. But definitely willing to consider alternative suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative, I'm not sure I prefer it, would be to make something like a non-owning counted reference object. We understand that such an object doesn't imply shared ownership, so doesn't interfere with the referred to object lifecycle, but it does maintain a count of references which have been created. We'd need a weak version of this as well with similar semantics, but which doesn't throw when upgrading to a full reference. (This is awfully like a shared_ptr/weak_ptr, so I'm not sure I want to go to the effort of writing this class.)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great one Tom! I just have a clarification question.

//! Updates the model of the bucket count with a new measurement
void update(core_t::TTime time, std::size_t bucketCount);
//! Get an estimate of the current bucket completeness.
double completeness() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of latency this will return the completeness of the last bucket for which estimateBucketCompleteness is called, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. This is just a cache of the result of completeness calculation which is populated in the CCountingModel::sampleBucketStatistics. We should call all these sample methods before using the completeness estimate and these are always called when getting interim results so this really just to avoid us calling estimateBucketCompleteness for each individual model, just to compute the same value.

This is actually a convenient optimisation, because I noticed that evaluating the trend cropped up high when profiling recently, i.e. #73.

@tveasey tveasey force-pushed the enhancement/improve-partition-memory branch from d89df48 to a580ae4 Compare May 18, 2018 09:50
@tveasey tveasey force-pushed the enhancement/improve-partition-memory branch from a580ae4 to 52868ba Compare May 18, 2018 12:48
@tveasey tveasey force-pushed the enhancement/improve-partition-memory branch from a3c3451 to a2a41af Compare May 18, 2018 14:01
Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey force-pushed the enhancement/improve-partition-memory branch from f7eb8c4 to 5e5d9b5 Compare May 22, 2018 10:18
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@tveasey tveasey removed the request for review from droberts195 May 22, 2018 11:05
@tveasey tveasey merged commit e0d176e into elastic:master May 22, 2018
tveasey added a commit that referenced this pull request May 22, 2018
Move to one interim bucket corrector, which is only updated by the (single) counting model,
and is shared by all models.
@sophiec20 sophiec20 added :ml and removed :ml labels Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants