-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-up JavaStreamQueryEngine with better stream utilization #224
Comments
There are some further memory optimizations available as well: |
After doing some careful testing I have discovered that (4) is more on the right track of producing faster results with less memory usage. By performing histogram logic via Streams with no middle objects in-between we can greatly speed-up filtering and binning performance times to nearly 2-3x speed.
Testing this with about 1M files on my own local machine was leading to some remarkable speed-ups. |
Wow I am already seeing vast improvements on one of our largest NNA instances. This change seems to live up to its 2-3x speed-up. Just comparing the times of
Which is about 2.5x faster and with even more files to scan than the prior timing. |
The performance improvements to most histograms is in the realm of 6-10x. I clocked a FileType histogram for 400M files taking about 7-10 seconds. This is much much faster than before. More testing needs to be done but this will be a substantial improvement to NNA. |
I believe the latest PR #228 is primed for major changes in memory savings and performance. The biggest time cost, the directories and cached directories queries in SuggestionEngine, have been made much much faster now and no longer need to retain their own INode sets. They will now produce ContentSummary objects instead which detail exactly what we want per directory. I just need to test a bit more and then we can move onto committing and forging a release for this. |
This is the snippet from the latest run on a giant cluster. 400M files analyzed for various attributes in 25 minutes. I suspect we can do even better and possible achieve 15-20 minutes.
Memory usage was also greatly reduced. This particular instance is only clocking 220G heap right now at max usage despite being allocated for 250G. It was crashing from OOME as mentioned earlier. |
These are the new numbers as of the latest PR.
Shaved off another 4 minutes. 400M files done in 18 minutes now. I believe the directories scan is as optimized as it can be. Larger targets now are fileAges and cachedQuotas. Which I believe we can do more on. |
On a particularly large cluster (over 400M files), we ran into a point where an NNA installation was unable to keep up with the paired NameNode(s).
This NNA installation was able to digest the FsImage and stay up to date but upon beginning its background analysis threads it would run into massive full GCs that would make the NNA instance out of order for nearly 30 minutes.
The main culprit here is that the analysis sequence done by the SuggestionEngine class keeps large result HashMaps in memory during the entire time of the analysis. I initially calculated these Maps to be small in memory footprint but with the increasing analysis added and the rather expensive directory Maps as well they have finally overwhelmed the NNA instance.
There are several things we can do to mitigate this:
(1) - Put the resulting maps from queries directly into a temporary file cache (via CacheManager class). During the sync-switch we can simply flip one file cache for another.
(2) - Put the resulting maps from queries directly into the main suggestion cache, thereby overwriting the current values immediately. Could cause numbers to not line up neatly until entire processing is done.
Option 1 would likely provide the best assurance that numbers will add up correctly. Option 2 will optimize for free'ing the most amount of memory and disk space. I am open to other suggestions as well.
The text was updated successfully, but these errors were encountered: