-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collision of hashed IDs #4612
Comments
Dear ImGui has used the same hashing algorithm (CRC32) since it's initial release in 2014. If nobody has ever had a major unexpected hash collision in the past seven years, you probably will not be the first.
Keep in mind that IDs are stored in a stack. You can use Out of curiosity I wrote a dirty little test to check how long it would take to create a collision using the pattern set out in the FAQ for submitting many widgets programmatically: #include "imgui_internal.h"
#include <unordered_set>
static void GH4612()
{
ImGui::Begin("GH-4612");
std::unordered_set<ImGuiID> hashes;
ImGuiWindow* window = ImGui::GetCurrentWindow();
for (int i = 0; i >= 0; i++)
{
if (i % 1000 == 0)
{
printf("%d...\n", i);
}
ImGuiID id = window->GetIDNoKeepAlive(i);
if (!hashes.insert(id).second)
{
printf("Hash collision happened for %d (hash = %d)\n", i, id);
IM_ASSERT(false);
}
}
printf("No hash collisions for all positive integers.\n");
ImGui::End();
} This test went through around 536,870,000 tests in ~10 minutes before it consumed so much memory that my computer locked up trying to swap things in and out of the page file and I had to force reboot it and rewrite this entire comment. 😅 I'm no statistics expert, but I think that means you're more likely to win the lottery than you are to stumble upon a hash collision in Dear ImGui. Finally, if it's still keeping you up at night, you could pretty easily replace the implementation of |
I think eventually we'll switch to using a lightweight 64-bit hash function to reduce that risk further. I honestly expected people to see CRC32 in profiler and that would be a reason to change, but even though CRC32 is quite unoptimal (and essentially hogs 1 KB of data in cache) that hasn't happened. Several people have suggested to replace CRC32 but seemingly everyone did it from the theoretical point of view KNOWING that CRC32 is unoptimal, not from the practical point of view of seeing CRC32 in profiler measurements. |
@PathogenDavid: Thanks for your response. And mentioning the enormous usage since the very first beginning of Dear ImGui without big complains concerning the ID collision is indeed a strong hint. Your test program is even better. I enhanced your idea, using the word-list from https://github.com/dwyl/english-words and calling ImHashStr for all words with the same stack-base-id (tested 1 til 15). For every used stack-base-id i got between twenty and thirty collisions (the word-list contains ca. 465.000 words). Mostly of them absolutely unrelated and extremly seldom used expressions. But some of them could be (even theoretically) a folder/file-name in the same namespace and therefore same stack-base-id. Collision-Examples (in the order: Seed, Word A, Word B):
@ocornut : Thanks for your response, Maestro - impressive work done with this fine peace of software. Your idea on more bits is interesting but it does not help the general question. I have no problem with the implementation of the hash method itself, but with the possibility of collisions. And even another hash method with more bits and with a more complex algorithm could (AND WOULD) have collisions. That is the nature of hash-functions. No, the current hash method is fine (it needs still fast enough to calculate thousands of widgets in every frame). But to put it in a game/product and/or to use it for example on a explosion-hazardous area and/or to get a specific certificate the software needs to be absolutely reliable. That is not possible with the current hash-method. But that is the only thing i have seen in the code, which does not meet the high standard of all other components of Dear ImGui (I don't mean that in a bad way, I just don't have the right words here. For most users, everything seems to work fine). And the simplest reliable solution would be to let the developer provide the IDs. I think, next days I will make a try. I will implement functions to provide additionally(!) a custom ID and try to convert the demo app to this style. If there is some interests on the implementation, I can provide it, when I am ready. Perhaps I come to another conclusion during that work. |
@ocornut I am not an expert at cache stuff, but in theory wouldn't the issue potentially hide in other parts of the program because the processor is smart enough to keep the CRC table warm in the cache at the expense of other things?
@Triseltan Speaking as someone who has worked on safety-critical software before (avionics specifically) and was responsible for a lot of the safety certification aspect of our software: Dear ImGui should never be used in a safety-critical context. |
@PathogenDavid : I don't think, this is a place to discuss this, but I can understand your point. I looked into the using software list and it is just astonishing. I can remember long ago (I follow the project since the beginning), when @ocornut first time mentioned that he heard, Blizzard has used Dear ImGui ... or when someone found a mention in some nintendo tools. And with Sponsors like Blizzard and Ubisoft ... who am I to overthink such successful people (Blizzard itself uses a string hash algorithm inside their mpq packages ... and rename a resource during development if there are a collision). |
The experiment with user-generated IDs. Basically I can say: It works. However, there are three major limitations. First: There is the well-known problem of how to generate the IDs. Since the beginning of the idea of Immediate GUIs the necessary unique assignment of a widget created 'on-the-fly' over several frames was a problem. Original ideas like the Secondly, it is extremely time-consuming to change all the necessary functions. Dear ImGui is simply not designed to pass IDs (except as a new seed for the stack, or encapsulated as a string in the ### possibility). Moreover, quite a few IDs are also generated hidden in the middle of various functions (#MOVE, #CLOSE, #COLLAPSE, ##Menu_..., ##Popup_... and so on). There one would have to intervene also, since a mixing of hash values and user-generated IDs could lead again to collisions, which we want to avoid straight. Last but not least, the IDStack would become obsolete, but it is used for other operations besides ID assignment (comparisons, evaluations, etc.). Thirdly: The particularly simple work with Dear ImGui would become a tedious, time-consuming work, because you have to generate IDs at every point, remember them (static or in objects with an additional ID field or as an extra maintained enumeration) and still have to pass them. But it is this simple, fast work with Dear ImGui that brought it to this immense success. It is thanks to @ocornut that he has consistently relied on automatic generation of the IDs right from the start of the project. And I am convinced that this was the decisive factor to prefer Dear ImGui to all other immediate mode GUIs. In conclusion, my idea would be "somehow" implementable ... but except for the preventing collision it would create a lot of new problems and a lot of opaque code. Lastly: I have another idea in the background, which I will present shortly. |
Trying with a symbol tableSince I'm using Dear ImGui for some side projects, I'm thinking about how to eliminate collisions. At the very beginning I thought about a symbol table, like they are used in in compiler construction (but there for storing identifiers and detecting duplicate identifiers ... exactly these collisions). But then I discarded it, because the effort seemed too huge. But now I would like to use Dear ImGui professionally (and failed with the alternative user-generated IDs - see comment above), so I looked into it again. What makes them so fascinating to me, and just right for tracking UI elements, are the following reasons:They have been around for a very long time and have been studied since the first compilers, support scopes, are designed to find entries with specific string as fast as possible and insert new entries without wasting time, and store entries of different length while not wasting too much memory. To make a long story short, it worked better than expected.I had to rework all GetID and PushID calls (especially tables and loading/saving the settings data was a bit tricky). It's not perfect by any means (see open issues below), but it's already working quite reliably. How does it work: My original plan was to continue using the IDs generated so far, just including a sort into the symbol table. On the one hand this would have had the charm that all IDs would have remained the same, but on the other hand it would have had a big drawback: All comparisons of IDs (just generated ID, ActiveID, MoveID, etc.) would have had to compare the ID everywhere first and - if they were the same - would have had to search the symbol table. Making changes everywhere would have been extremely time-consuming and also bad for further maintenance. Fortunately, I came up with a better idea. Instead of using the generated ID, the offset in the symbol table is used (all "symbols" with the same ID are linked together in a simple concatenated list, but stored in different places in the symbol table memory). A symbol entry is (currently) composed as follows:
Here is a picture of the current development (left is an excerpt of the symbol table - with the info in each case to offset, stack offset aka scope, type and data ): Open points are still
Interesting: While testing with the symbol table I already found a first collision in the current code (Synced Tables: First Column Header of the first table and First Column Header of the second table). Here is the GIF for it. First I hover over one, then the other element, then hover + click on the first, then the second widget. (btw the symbol table version prevented exactly this collision in the test) p.s. I always like to give the thoughts and inner workings of my coding. If that is not desired, or there is generally not much interest in the topic of avoiding collisions, feel free to close the issue. |
(Edited a paragraph out, had previously misunderstood your last example).
Everything in computing and security and cryptography uses hashes. A good enough one-way hash function is assumed non-colliding and if we can base our whole modern digital infrastructure on that idea I am sure dear imgui could as well. It’s just a matter of using a slightly fancier function than crc32.
In spite of your first message stating “I don’t want to reinvent the wheel”, Right now I feel you are manufacturing your own problem.
|
I'm a bit late to the party and not sure if the "not-the-best-performant-hash" issue is still valid, but it can be optimized for strings known at compile time. For comparison: // Note HashedString and Hash are made-up names
HashedString str = Hash("str"); // "str" is a string known at compile-time, calculate hash at compile time
std::string str = "str";
HashedString str = Hash(str); // str is NOT a string known at compile-time, fall back to runtime calculations The idea is explained on this article and EnTT offers an implementation. The TL;DR is that |
No it cannot. You are misunderstanding how the ID stack system of dear imgui works. Every hash is seeded by a value depending on the context. They are not "absolute" hashes. https://github.com/ocornut/imgui/blob/master/docs/FAQ.md#q-about-the-id-stack-system |
My apologies, you are right. I wasn't totally aware about how the id stack works and misunderstood it. |
I'll close this topic as IHMO at this point is an undesirable hogging of attention. I do not disagree that we could eventually switch to a 64-bit hash (I think we will eventually, both to reduce collision and to use code which doesn't hog a kilobyte of data in cache) but specifically the author of this topic stated it didn't matter to them and would rather engage in reinventing a wheel they decided didn't work. This kind of research is great and feel free to post here if you want to share it, but we can't provide support for a non-issue. |
I think it would be nice to have some numbers for the actual probability of random collision for an ideal 32-bit hash function in dependence on the number of samples:
from https://en.wikipedia.org/wiki/Birthday_attack The problem is quite real. |
Also #74, #96, #480, #501, #647, #654, #719, #843, #894, #1057, #1173, #1390, #1414, #1556, #1768, #2041, #2116, #2330, #2475, #2562, #2667, #2807, #2885, #3102, #3375, #3526, #3964, #4008, #4070, #4158, #4172, #4199, #4375, #4395, #4471, #4548, #4612, #4631, #4657, #4796, #5210, #5303, #5360, #5393, #5533, #5692, #5707, #5729, #5773, #5787, #5884, #6046, #6093, #6186, #6223, #6364, #6387, #6567, #6692, #6724, #6939, #6984, #7246, #7270, #7375, #7421, #7434, #7472, #7581, #7724, #7926, #7937 and probably more.. Tagging to increase visibility!
I want to use ImGui in my product, but i fear the collision of the hashed IDs.
This is not a point of - for example - buttons, which i create during development. These collisions are detected right there and make no problems. The problem can arise, if there are any user generated content. Like an hierarchical scene-object treeview or a file explorer. If the user can not select a file, because another one is always selected first, that would be just annoying. But when the wrong functionality is executed because of such collision, that would be already a NoGo.
One possibility would be to change the code to accept at all places also customer IDs (not to confuse with the stringification at end with ##). There are only some places, where GetID() is called, it would be easy to adapt the functions to this. Are there any other possibilities or perhaps hints, how other users work with this 'fear'?
Greetings, and sorry, for my bad english ... not my native one.
p.s. I have alreday worked with ImGui (small app for live-testing of GLSL and some debugging ui) and it works like a charm. Thats why I do not want to reinvent the wheel again.
The text was updated successfully, but these errors were encountered: