Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up JavaStreamQueryEngine with better stream utilization #224

Closed
pjeli opened this issue May 21, 2019 · 7 comments · Fixed by #228
Closed

Speed-up JavaStreamQueryEngine with better stream utilization #224

pjeli opened this issue May 21, 2019 · 7 comments · Fixed by #228
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@pjeli
Copy link
Collaborator

pjeli commented May 21, 2019

On a particularly large cluster (over 400M files), we ran into a point where an NNA installation was unable to keep up with the paired NameNode(s).

2019-05-20T20:37:34.787-0700: 16666.457: [JNI Weak Reference, 0.0015219 secs] (promotion failed): 23321242K->20987827K(23592960K), 33.2502769 secs]2019-05-20T20:38:04.924-0700: 16696.594: [CMS2019-05-20T20:38:43.761-0700: 16735.433: [CMS-concurrent-sweep: 287.298/354.271 secs] [Times: user=3363.42 sys=378.42, real=354.28 secs] (concurrent mode failure)
2019-05-20T20:58:25.248-0700: 17916.919: [JNI Weak Reference, 0.0023017 secs]: 214848412K->180770026K(232783872K), 1804.9779061 secs] 238169655K->180770026K(256376832K), [Metaspace: 46020K->46020K(49152K)], 1838.2525178 secs] [Times: user=1211.75 sys=331.76, real=1838.25 secs]

This NNA installation was able to digest the FsImage and stay up to date but upon beginning its background analysis threads it would run into massive full GCs that would make the NNA instance out of order for nearly 30 minutes.

The main culprit here is that the analysis sequence done by the SuggestionEngine class keeps large result HashMaps in memory during the entire time of the analysis. I initially calculated these Maps to be small in memory footprint but with the increasing analysis added and the rather expensive directory Maps as well they have finally overwhelmed the NNA instance.

There are several things we can do to mitigate this:
(1) - Put the resulting maps from queries directly into a temporary file cache (via CacheManager class). During the sync-switch we can simply flip one file cache for another.
(2) - Put the resulting maps from queries directly into the main suggestion cache, thereby overwriting the current values immediately. Could cause numbers to not line up neatly until entire processing is done.

Option 1 would likely provide the best assurance that numbers will add up correctly. Option 2 will optimize for free'ing the most amount of memory and disk space. I am open to other suggestions as well.

@pjeli pjeli added enhancement New feature or request help wanted Extra attention is needed labels May 21, 2019
@pjeli
Copy link
Collaborator Author

pjeli commented May 21, 2019

There are some further memory optimizations available as well:
(3) - Making use of NameNode's SerialNumberManager instead of producing our own String<->id mappings.
(4) - Reducing the creation of intermittent Java Collections.

pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 22, 2019
@pjeli
Copy link
Collaborator Author

pjeli commented May 23, 2019

After doing some careful testing I have discovered that (4) is more on the right track of producing faster results with less memory usage.

By performing histogram logic via Streams with no middle objects in-between we can greatly speed-up filtering and binning performance times to nearly 2-3x speed.

  public Map<String, Long> genericSummingHistogram(Stream<INode> inodes,
      Function<INode, String> namingFunction,
      Function<INode, Long> longFunction) {
    return inodes.collect(Collectors.groupingBy(namingFunction,
            Collectors.mapping(longFunction, Collectors.summingLong(i -> i))));
  }

Testing this with about 1M files on my own local machine was leading to some remarkable speed-ups.
At this point this seems like the right way to go.

@pjeli pjeli changed the title Finer-grained analysis Speed-up JavaStreamQueryEngine with better stream utilization May 23, 2019
@pjeli pjeli added this to the 1.6.3 Release milestone May 23, 2019
pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 23, 2019
pjeli added a commit to pjeli/NNAnalytics that referenced this issue May 23, 2019
@pjeli
Copy link
Collaborator Author

pjeli commented May 24, 2019

Wow I am already seeing vast improvements on one of our largest NNA instances. This change seems to live up to its 2-3x speed-up. Just comparing the times of /directories from #191 I clocked a computation of SuggestionsEngine.directories at:

Performing SuggestionsEngine.directories took: 155922 ms

Which is about 2.5x faster and with even more files to scan than the prior timing.

@pjeli
Copy link
Collaborator Author

pjeli commented May 24, 2019

The performance improvements to most histograms is in the realm of 6-10x. I clocked a FileType histogram for 400M files taking about 7-10 seconds. This is much much faster than before.

More testing needs to be done but this will be a substantial improvement to NNA.

@pjeli
Copy link
Collaborator Author

pjeli commented May 27, 2019

I believe the latest PR #228 is primed for major changes in memory savings and performance. The biggest time cost, the directories and cached directories queries in SuggestionEngine, have been made much much faster now and no longer need to retain their own INode sets. They will now produce ContentSummary objects instead which detail exactly what we want per directory.

I just need to test a bit more and then we can move onto committing and forging a release for this.

@pjeli
Copy link
Collaborator Author

pjeli commented May 27, 2019

This is the snippet from the latest run on a giant cluster. 400M files analyzed for various attributes in 25 minutes. I suspect we can do even better and possible achieve 15-20 minutes.

Performing SuggestionsEngine.capacity took: 9019 ms.
Performing SuggestionsEngine.fileAges took: 52482 ms.
Performing SuggestionsEngine.users took: 3004 ms.
Performing SuggestionsEngine.diskspace took: 10945 ms.
Performing SuggestionsEngine.files24hr took: 3848 ms.
Performing SuggestionsEngine.files1yr2yr took: 16727 ms.
Performing SuggestionsEngine.perUserCount took: 6459 ms.
Performing SuggestionsEngine.directories took: 755828 ms.
Performing SuggestionsEngine.systemFilter took: 38126 ms.
Performing SuggestionsEngine.system24hr took: 1639 ms.
Performing SuggestionsEngine.system1yr took: 2317 ms.
Performing SuggestionsEngine.systemCount took: 4720 ms.
Performing SuggestionsEngine.perUserFilter took: 4957 ms.
Performing SuggestionsEngine.perUser24h took: 108 ms.
Performing SuggestionsEngine.perUser1yr took: 310 ms.
Performing SuggestionsEngine.perUserMem took: 5805 ms.
Performing SuggestionsEngine.perUserDs took: 2956 ms.
Performing SuggestionsEngine.directories24h took: 9128 ms.
Performing SuggestionsEngine.cachedQuotas took: 536554 ms.
Performing SuggestionsEngine.cachedLogins took: 76842 ms.
Performing SuggestionsEngine.cachedQueries took: 2 ms.

Memory usage was also greatly reduced. This particular instance is only clocking 220G heap right now at max usage despite being allocated for 250G. It was crashing from OOME as mentioned earlier.

@pjeli
Copy link
Collaborator Author

pjeli commented May 27, 2019

These are the new numbers as of the latest PR.

Performing SuggestionsEngine.capacity took: 14964 ms.
Performing SuggestionsEngine.fileAges took: 66038 ms.
Performing SuggestionsEngine.users took: 3694 ms.
Performing SuggestionsEngine.diskspace took: 10446 ms.
Performing SuggestionsEngine.files24hr took: 3398 ms.
Performing SuggestionsEngine.files1yr2yr took: 8296 ms.
Performing SuggestionsEngine.perUserCount took: 3899 ms.
Performing SuggestionsEngine.directories took: 296328 ms.
Performing SuggestionsEngine.systemFilter took: 37615 ms.
Performing SuggestionsEngine.system24hr took: 5634 ms.
Performing SuggestionsEngine.system1yr took: 4921 ms.
Performing SuggestionsEngine.systemCount took: 3457 ms.
Performing SuggestionsEngine.perUserFilter took: 1843 ms.
Performing SuggestionsEngine.perUser24h took: 100 ms.
Performing SuggestionsEngine.perUser1yr took: 318 ms.
Performing SuggestionsEngine.perUserMem took: 5376 ms.
Performing SuggestionsEngine.perUserDs took: 2913 ms.
Performing SuggestionsEngine.directories24h took: 82775 ms.
Performing SuggestionsEngine.cachedQuotas took: 532306 ms.
Performing SuggestionsEngine.cachedLogins took: 27460 ms.
Performing SuggestionsEngine.cachedQueries took: 3 ms.

Shaved off another 4 minutes. 400M files done in 18 minutes now.

I believe the directories scan is as optimized as it can be. Larger targets now are fileAges and cachedQuotas. Which I believe we can do more on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
1 participant