-
Notifications
You must be signed in to change notification settings - Fork 5k
[NativeAOT] Faster lookup for method infos on Windows. #96283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsBinary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for function infos using locations in the code. A B-tree like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.
|
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
To measure the impact I use the following microbenchmark The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better. I see ~ 10% improvement. ==== Before the change:
=== After the change:
|
src/coreclr/nativeaot/Runtime/windows/CoffNativeCodeManager.cpp
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/Runtime/windows/CoffNativeCodeManager.cpp
Outdated
Show resolved
Hide resolved
What does this do to startup time and startup working set? |
It would be best to create the secondary index at build time somehow. |
I think it will be fast, but it is definitely worth measuring. It is also pay-for play. - In a sense that a program with a few managed methods would not see much impact. The index is roughly ~2 bytes per method body, so it would need to be something pretty large. The referenced benchmark has ~4 thousand methods and the size of the lowest index level is just ~1.5Kb, other levels are much less (log16 drops fast). I will check what happens with large(ish) testcases like Concurrent.Collections. I think that has tens of thousands methods. |
Other questions to consider:
|
=== Some data for the System.Collections.Concurrent.Tests.exe (17.5 Mb binary )
the last level index size is 8108 elements - roughly 32 Kb By the time we got to allocating the index we've already allocated 1.2 Mb of native memory. Managed code could not run yet. |
==== Stats for the TodosApi.exe (19 Mb binary) m_nRuntimeFunctionTable 65239 the outer index size is 8155 elements, roughly the same 32Kb as with ConcurrentCollections test By the time we got to allocating the index we've already alocated 8 Mb (this is SVR) |
Does this mean that we will page in 65239 * 8 = ~510kB to build this index? That is a lot.
IIRC, Linux ELF has some sort of secondary index already. Native AOT has a strong bias towards great startup time. It is the key characteristics that makes it different from regular CoreCLR with a JIT. I do not think we want to be regressing native AOT startup to improve GC microbenchmarks. |
I completely understand the concerns. I did not measure the time though. It may tell a different story. |
Also - it is possible to shift the timing of creating the index until we need it. In start-up terms it is an eternity till the first GC. The first suspending thread could pay for building the index. We have several options for moving costs out of startup path, but let’s see if we need that. |
Some preliminary report on timings (very rough estimates) === based on System.Collections.Concurrent.Tests.exe (17.5 Mb binary ).
|
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
171 microseconds and 2% of startup time seems overall and relatively cheap. It is measurable though and we do not need the index until the first exception is thrown or the first GC happens. |
I was thinking about this and it looks like it could get complicated. We would need to teach the IL compiler about this optimization, find a way to get offsets of all managed methods at compile time, build the index, have some place in the file format to store the index and then fetch it at module initialization. Another potential problem is that the table constructed this way will cover only managed methods (which by itself is sufficient), but the actual runtime function table will contain infos for unmanaged methods too, so we would need to "re-bias" the indices to account for the index of the fist managed method. To do such re-biasing we would need to figure the index of the first managed method, possibly via binary search, which will raise the same question - would we want to pay the costs of searching and then rebiasing at startup? |
Instead of generating the index in ILC.exe, could it be done after link.exe creates the program executable? A post-processing step could create a new section in the executable and put the index there. The obvious downside I see with this approach is it would complicate the MSBuild target files for native aot. |
It is very rare to see a deep stack composed of tiny methods. Have you done any perf measurements for typical stacks found in real-world workloads? If we were to assume more typical methods (typical method code size is 100's bytes), would the algorithm design be different? It may be better to allocate a single |
Yes, this would be the way to build this lookup map at build time. On Windows, the new section can be a resource - it is easy to attach a new resource to a Windows executable. Having said this, creating the index as lazily as possible at runtime would work too. |
What makes that benchmark somewhat unnatural is that in regular programs the stack walks are often repetitive as often only a portion of program's code is active, while in this benchmark nearly all method infos will be needed on every GC. I think the point was to defeat any attempts to profit from temporal locality (i.e. by caching recent lookups) by demonstrating that a program with poor method locality can be easily constructed. I do not think the average size of methods in the benchmark makes a difference though. Ultimately we are simply doing a search in a sorted array of ints. We can actually know the average distance between ints (method size) from the first/last int in the array and their count, if knowing could help.
If I understand the idea correctly, it would be something like interpolation search - given an IP, we will try guessing the approximate entry via linear interpolation and then search in the vicinity of that entry. I think, just like with interpolation search, this will require that distribution of sizes is very uniform - to not end up with best case slightly better, while average/worst case much worse than divide-and-conquer. I.E roughly speaking with 16^4 == 65536 methods, we will see 4 cache misses in both average and worst case. I think this is close to the minimum possible and will be hard to improve by a lot. |
Right, the microbenchmark is somewhat unnatural and it demonstrates ~8% improvement. 8% improvement for a microbenchmark is not that much. I would like to understand average improvement from this change in real world workloads so that we can reason about whether the change is worth the added complexity and memory, and also whether the chosen algorithm produces the best average improvement in real world workloads. |
This change stacks very well with another improvement in stackwalking (#96150). When measured with both changes we get close to 2X GC pause compared to conservative stackwalking (on this benchmark). We started at about 3X on the ratio of precise to conservative stackwalks. I see it as an improvement. One way or another. I am not sure if anything that we consider "real world" will notice 8 percent in a specific area, but as I remember some ASP.Net benchmarks had noticeably faster throughput when run with conservative GC. There could be many explanations why, but foremost it indicates an opportunity. My thought is that improving performance of stackwalks incrementally should not make things worse and with enough micro improvements we may see macro results. I am looking at constituent parts of a typical stackwalk that can be improved. This PR is about searching for the method info as it is a significant part of the overall cost according to profiler. Others are unwinding and GC decoding. Touching all that at once could be an unwieldy change. This benchmark was chosen because it is sensitive to the area that is changed and useful as a repeatable measure of the progress. |
Can we measure how much faster are these ASP.NET benchmarks with this change then? |
I like #96150. It was a pure code optimization, no downsides. This change is a tradeoff. It allocates some extra memory to get some extra throughput in an artificial benchmark. My experience is that changes like that have a high probability of being ineffective in practice, so I am looking for a proof that it is not the case. |
I'm curious if this BTree approach is the best approach. Remember that RUNTIME_FUNCTION lookup is a fairly simple sorted mapping. Another approach that I've thought of is to build a table of runtime function indices where each index the lowest possible index for a given region of memory. Then we could layer that on the existing RUNTIME_FUNCTION table, and get O(1) lookup to some level of granularity, and then a very short binary walk to find the exact target. For instance, size the index table so that it spans the .TEXT region at a granularity such that the average space between indices is 8 or 16 RUNTIME_FUNCTIONS. Then the index table becomes a single lookup, followed by a normal scan in a much smaller region of the RUNTIME_FUNCTION table. I agree with @jkotas that we need a real world scenario before doing too much allocation and such for all of this though. Ahh, I see that @jkotas already suggested my idea, although he proposed a larger amount of memory per entry in the interpolation table. What I've seen is that while managed methods can be highly variable in size, they are uniform enough in practice for an interpolation style approach to be quite effective. In fact, my guess is that it's likely more efficient than a BTree. In fact if I were benchmarking this, I would try the interpolation approach + a linear scan of RUNTIME_FUNCTION entries in the possible region. My guess is that you will find through experimentation that it isn't cache misses you are need to avoid, but branch mispredicts, and an interpolation table + a single linear scan will minimize those. |
This is getting old. I do not have a new ideas here. |
Binary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for lots of function infos.
A B+ tree-like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.