Speed-up JavaStreamQueryEngine with better stream utilization #224

pjeli · 2019-05-21T04:16:55Z

On a particularly large cluster (over 400M files), we ran into a point where an NNA installation was unable to keep up with the paired NameNode(s).

2019-05-20T20:37:34.787-0700: 16666.457: [JNI Weak Reference, 0.0015219 secs] (promotion failed): 23321242K->20987827K(23592960K), 33.2502769 secs]2019-05-20T20:38:04.924-0700: 16696.594: [CMS2019-05-20T20:38:43.761-0700: 16735.433: [CMS-concurrent-sweep: 287.298/354.271 secs] [Times: user=3363.42 sys=378.42, real=354.28 secs] (concurrent mode failure)
2019-05-20T20:58:25.248-0700: 17916.919: [JNI Weak Reference, 0.0023017 secs]: 214848412K->180770026K(232783872K), 1804.9779061 secs] 238169655K->180770026K(256376832K), [Metaspace: 46020K->46020K(49152K)], 1838.2525178 secs] [Times: user=1211.75 sys=331.76, real=1838.25 secs]

This NNA installation was able to digest the FsImage and stay up to date but upon beginning its background analysis threads it would run into massive full GCs that would make the NNA instance out of order for nearly 30 minutes.

The main culprit here is that the analysis sequence done by the SuggestionEngine class keeps large result HashMaps in memory during the entire time of the analysis. I initially calculated these Maps to be small in memory footprint but with the increasing analysis added and the rather expensive directory Maps as well they have finally overwhelmed the NNA instance.

There are several things we can do to mitigate this:
(1) - Put the resulting maps from queries directly into a temporary file cache (via CacheManager class). During the sync-switch we can simply flip one file cache for another.
(2) - Put the resulting maps from queries directly into the main suggestion cache, thereby overwriting the current values immediately. Could cause numbers to not line up neatly until entire processing is done.

Option 1 would likely provide the best assurance that numbers will add up correctly. Option 2 will optimize for free'ing the most amount of memory and disk space. I am open to other suggestions as well.

The text was updated successfully, but these errors were encountered:

pjeli · 2019-05-21T18:41:39Z

There are some further memory optimizations available as well:
(3) - Making use of NameNode's SerialNumberManager instead of producing our own String<->id mappings.
(4) - Reducing the creation of intermittent Java Collections.

pjeli · 2019-05-23T17:53:51Z

After doing some careful testing I have discovered that (4) is more on the right track of producing faster results with less memory usage.

By performing histogram logic via Streams with no middle objects in-between we can greatly speed-up filtering and binning performance times to nearly 2-3x speed.

  public Map<String, Long> genericSummingHistogram(Stream<INode> inodes,
      Function<INode, String> namingFunction,
      Function<INode, Long> longFunction) {
    return inodes.collect(Collectors.groupingBy(namingFunction,
            Collectors.mapping(longFunction, Collectors.summingLong(i -> i))));
  }

Testing this with about 1M files on my own local machine was leading to some remarkable speed-ups.
At this point this seems like the right way to go.

…ation

pjeli · 2019-05-24T18:10:13Z

Wow I am already seeing vast improvements on one of our largest NNA instances. This change seems to live up to its 2-3x speed-up. Just comparing the times of /directories from #191 I clocked a computation of SuggestionsEngine.directories at:

Performing SuggestionsEngine.directories took: 155922 ms

Which is about 2.5x faster and with even more files to scan than the prior timing.

pjeli · 2019-05-24T20:43:08Z

The performance improvements to most histograms is in the realm of 6-10x. I clocked a FileType histogram for 400M files taking about 7-10 seconds. This is much much faster than before.

More testing needs to be done but this will be a substantial improvement to NNA.

pjeli · 2019-05-27T06:09:58Z

I believe the latest PR #228 is primed for major changes in memory savings and performance. The biggest time cost, the directories and cached directories queries in SuggestionEngine, have been made much much faster now and no longer need to retain their own INode sets. They will now produce ContentSummary objects instead which detail exactly what we want per directory.

I just need to test a bit more and then we can move onto committing and forging a release for this.

pjeli · 2019-05-27T09:11:49Z

This is the snippet from the latest run on a giant cluster. 400M files analyzed for various attributes in 25 minutes. I suspect we can do even better and possible achieve 15-20 minutes.

Performing SuggestionsEngine.capacity took: 9019 ms.
Performing SuggestionsEngine.fileAges took: 52482 ms.
Performing SuggestionsEngine.users took: 3004 ms.
Performing SuggestionsEngine.diskspace took: 10945 ms.
Performing SuggestionsEngine.files24hr took: 3848 ms.
Performing SuggestionsEngine.files1yr2yr took: 16727 ms.
Performing SuggestionsEngine.perUserCount took: 6459 ms.
Performing SuggestionsEngine.directories took: 755828 ms.
Performing SuggestionsEngine.systemFilter took: 38126 ms.
Performing SuggestionsEngine.system24hr took: 1639 ms.
Performing SuggestionsEngine.system1yr took: 2317 ms.
Performing SuggestionsEngine.systemCount took: 4720 ms.
Performing SuggestionsEngine.perUserFilter took: 4957 ms.
Performing SuggestionsEngine.perUser24h took: 108 ms.
Performing SuggestionsEngine.perUser1yr took: 310 ms.
Performing SuggestionsEngine.perUserMem took: 5805 ms.
Performing SuggestionsEngine.perUserDs took: 2956 ms.
Performing SuggestionsEngine.directories24h took: 9128 ms.
Performing SuggestionsEngine.cachedQuotas took: 536554 ms.
Performing SuggestionsEngine.cachedLogins took: 76842 ms.
Performing SuggestionsEngine.cachedQueries took: 2 ms.

Memory usage was also greatly reduced. This particular instance is only clocking 220G heap right now at max usage despite being allocated for 250G. It was crashing from OOME as mentioned earlier.

pjeli · 2019-05-27T19:24:01Z

These are the new numbers as of the latest PR.

Performing SuggestionsEngine.capacity took: 14964 ms.
Performing SuggestionsEngine.fileAges took: 66038 ms.
Performing SuggestionsEngine.users took: 3694 ms.
Performing SuggestionsEngine.diskspace took: 10446 ms.
Performing SuggestionsEngine.files24hr took: 3398 ms.
Performing SuggestionsEngine.files1yr2yr took: 8296 ms.
Performing SuggestionsEngine.perUserCount took: 3899 ms.
Performing SuggestionsEngine.directories took: 296328 ms.
Performing SuggestionsEngine.systemFilter took: 37615 ms.
Performing SuggestionsEngine.system24hr took: 5634 ms.
Performing SuggestionsEngine.system1yr took: 4921 ms.
Performing SuggestionsEngine.systemCount took: 3457 ms.
Performing SuggestionsEngine.perUserFilter took: 1843 ms.
Performing SuggestionsEngine.perUser24h took: 100 ms.
Performing SuggestionsEngine.perUser1yr took: 318 ms.
Performing SuggestionsEngine.perUserMem took: 5376 ms.
Performing SuggestionsEngine.perUserDs took: 2913 ms.
Performing SuggestionsEngine.directories24h took: 82775 ms.
Performing SuggestionsEngine.cachedQuotas took: 532306 ms.
Performing SuggestionsEngine.cachedLogins took: 27460 ms.
Performing SuggestionsEngine.cachedQueries took: 3 ms.

Shaved off another 4 minutes. 400M files done in 18 minutes now.

I believe the directories scan is as optimized as it can be. Larger targets now are fileAges and cachedQuotas. Which I believe we can do more on.

pjeli added enhancement New feature or request help wanted Extra attention is needed labels May 21, 2019

pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 22, 2019

[paypal#224] Finer-grained analysis

0a6b712

pjeli mentioned this issue May 22, 2019

[#224] Finer-grained analysis #225

Closed

4 tasks

pjeli changed the title ~~Finer-grained analysis~~ Speed-up JavaStreamQueryEngine with better stream utilization May 23, 2019

pjeli added this to the 1.6.3 Release milestone May 23, 2019

pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 23, 2019

[paypal#224] Speed-up JavaStreamQueryEngine with better stream utiliz…

98be409

…ation

pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 23, 2019

[paypal#224] Speed-up JavaStreamQueryEngine with better stream utiliz…

766155b

…ation

pjeli mentioned this issue May 23, 2019

[#224] Speed-up JavaStreamQueryEngine with better stream utilization #228

Merged

4 tasks

pjeli closed this as completed in #228 May 30, 2019

pjeli added a commit that referenced this issue May 30, 2019

[#224] Speed-up JavaStreamQueryEngine with better stream utilization

4d0f316

This was referenced May 31, 2019

Investigate using different Collections libraries for filter speed-up #230

Closed

Actual numbers report for Space and Namespace quotas #229

Closed

Provide metrics for Hive tables and partitions #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up JavaStreamQueryEngine with better stream utilization #224

Speed-up JavaStreamQueryEngine with better stream utilization #224

pjeli commented May 21, 2019

pjeli commented May 21, 2019

pjeli commented May 23, 2019

pjeli commented May 24, 2019

pjeli commented May 24, 2019

pjeli commented May 27, 2019 •

edited

Loading

pjeli commented May 27, 2019 •

edited

Loading

pjeli commented May 27, 2019 •

edited

Loading

Speed-up JavaStreamQueryEngine with better stream utilization #224

Speed-up JavaStreamQueryEngine with better stream utilization #224

Comments

pjeli commented May 21, 2019

pjeli commented May 21, 2019

pjeli commented May 23, 2019

pjeli commented May 24, 2019

pjeli commented May 24, 2019

pjeli commented May 27, 2019 • edited Loading

pjeli commented May 27, 2019 • edited Loading

pjeli commented May 27, 2019 • edited Loading

pjeli commented May 27, 2019 •

edited

Loading

pjeli commented May 27, 2019 •

edited

Loading

pjeli commented May 27, 2019 •

edited

Loading