Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.nim: smaller init size #15048

Merged
merged 1 commit into from
Jul 23, 2020
Merged

Conversation

narimiran
Copy link
Member

There was a recent rightSize change in tables.nim, so the existing
value (4) was creating too large tables.


See #14995 (comment)

There was a recent `rightSize` change in tables.nim, so the existing
value (4) was creating too large tables.
@Araq Araq merged commit 14d16c2 into nim-lang:devel Jul 23, 2020
@hashbackup
Copy link

The real bug is in slotsNeeded in hashcommon.nim: nextPowerOfTwo(count * 3 div 2 + 4)

slotsNeeded should add 1, not 4 (no idea where 4 came from). Then requesting a 1-item table will give 2 slots, 2 items -> 4 slots, 3 items -> 8 (not ideal, could be 4), 4 -> 8, etc. 4-item tables need an empty slot so 8 is the smallest possible. As written, slotsNeeded yields 8 for a 1-3 item table, 16 for a 4-item table, etc. It doesn't matter for large tables of course, but for tiny tables it does.

Then in mustRehash, instead of (t.dataLen - t.counter < 4), it should be (t.dataLen - t.counter < 2). The reason is that mustRehash is called just before an insert, so there needs to be 1 free slot for the new item plus 1 free slot for a vacant entry to terminate missing key searches.

@narimiran
Copy link
Member Author

@c-blake your opinion on the above? (You're the author of mustRehash, IIRC)

@c-blake
Copy link
Contributor

c-blake commented Jul 23, 2020

I believe I left mustRehash completely as @Araq wrote it so many years ago 2/3 and 4 included, though re-arrangement of what code is in what files might obscure that. :-) I did add the old rightSize-now slotsNeeded, though.

I agree that "minimal tables" are sometimes useful when you have zillions of small tables. I could see json being one such case. So, I am +1 on the idea of lowering the minimum size.

I only would say two things. A) make sure that mustRehash and slotsNeeded are inverses in the proper sense in that mustRehash(slotsNeeded(n)) == false and that the boundary is tight mustRehash(slotsNeeded(n) + 1) == true, I think expresses that. Maybe not. I think what I mean by tight is clear, though..never waste memory. I had some tests somewhere in Nim about that at some point, but I could not find them when I looked for them recently.

and B) It might be possible to do even better than the idea here and have an empty table use no space at all (like seq[T]) and maybe its first non-zero .data.len needs to be 2 in order to terminate searches with an empty slot. I don't know if the json stuff has any purely empty tables or not. Even if json does not, it is easy to imagine it, especially with zillions of instances when the per instance cost matters most. (We might be able to save HCell.sizeof*2*zillions.)

Beyond these two things, while I have your attention, I feel I should say that in my adix I was experimenting, successfully I think, with an entirely alternate resize protocol that better leverages great hash functions relative to key sets. The idea there is to measure search depth on every insert and only grow the table when that depth is greater than some rational number numer/denom multiplied by log2(data.len) (the natural scale of the worst case search in open addressed hash tables, Robin Hood or not).

For example, if you happen to have a great hash, like -d:nimHashInt1 and a dense sequence of int keys, one of @Araq's favorite examples, then you get to use "100% - 1" slots of data[] with only length 1 collision clusters! Pretty great, right?

Besides that search depth measurement and growth rule, you do need a "memory attack" block for cases where, for example, someone set hash(x)=1234 by accident. So, in addition to searches being "too deep/long" the table also has to be > 50% occupied. So, then a runaway expansion from a bad hash is impossible. Rust used something like this for several years (like 2016-2019) of high profile people banging on their table. I think at first they forgot the memory attack block, but you know...Buncha FOSS people learning their way as they go. :-)

Anyway, a hash over-tuned for expected keys that winds up wrong about realized keys can cause other problems and adix has other mitigations for that, but that is all a little independent of the growth policy which is what you asked about.

@hashbackup
Copy link

As a double-check, I changed slotsNeeded and mustRehash in hashcommon.nim to:

proc mustRehash[T](t: T): bool {.inline.} =
  # If this is changed,reake sure to synchronize it with `slotsNeeded` below
  assert(t.dataLen > t.counter)
  result = (t.dataLen * 2 < t.counter * 3) or (t.dataLen - t.counter < 2)

proc slotsNeeded(count: Natural): int {.inline.} =
  # Make sure to synchronize with `mustRehash` above
  result = nextPowerOfTwo(count * 3 div 2 + 1)

Then added this tiny table test to tables.nim:

  block: # test tiny table sizes
    for i in 0 ..< 10:
      var t = initTable[int, int](i)
      for j in 0 ..< i + 10:
        t[j] = j
        doAssert t.len == j+1
        doAssert not (1000 in t)
        doAssert j in t

It works. But if t.counter < 2 is changed to t.counter < 1, it hangs, because there is not a vacant slot for missing keys so the search for key 1000 never terminates.

What is harder to check is that it is resizing to the minimum possible size, because there is no t.slots to get number of slots. I made t.data public so t.data.len is visible, added an echo to the test, and get this output (this is echoed after the insert occurs):

i=0 items= 1 slots=2 j=0
i=0 items= 2 slots=4 j=1
i=0 items= 3 slots=4 j=2
i=0 items= 4 slots=8 j=3
i=0 items= 5 slots=8 j=4
i=0 items= 6 slots=8 j=5
i=0 items= 7 slots=16 j=6
i=0 items= 8 slots=16 j=7
i=0 items= 9 slots=16 j=8
i=0 items= 10 slots=16 j=9
i=1 items= 1 slots=2 j=0
i=1 items= 2 slots=4 j=1
i=1 items= 3 slots=4 j=2
i=1 items= 4 slots=8 j=3
i=1 items= 5 slots=8 j=4
i=1 items= 6 slots=8 j=5
i=1 items= 7 slots=16 j=6
i=1 items= 8 slots=16 j=7
i=1 items= 9 slots=16 j=8
i=1 items= 10 slots=16 j=9
i=1 items= 11 slots=16 j=10
i=2 items= 1 slots=4 j=0
i=2 items= 2 slots=4 j=1
i=2 items= 3 slots=4 j=2
i=2 items= 4 slots=8 j=3
i=2 items= 5 slots=8 j=4
i=2 items= 6 slots=8 j=5
i=2 items= 7 slots=16 j=6
i=2 items= 8 slots=16 j=7
i=2 items= 9 slots=16 j=8
i=2 items= 10 slots=16 j=9
i=2 items= 11 slots=16 j=10
i=2 items= 12 slots=32 j=11
i=3 items= 1 slots=8 j=0   <--- this could be 4 slots but the math doesn't work
i=3 items= 2 slots=8 j=1   <--- and would have to be special-cased for 3 items
i=3 items= 3 slots=8 j=2
i=3 items= 4 slots=8 j=3   <--- here's where resize from 4 to 8 sshould occur
i=3 items= 5 slots=8 j=4
i=3 items= 6 slots=8 j=5
i=3 items= 7 slots=16 j=6
i=3 items= 8 slots=16 j=7
i=3 items= 9 slots=16 j=8
i=3 items= 10 slots=16 j=9
i=3 items= 11 slots=16 j=10
i=3 items= 12 slots=32 j=11
i=3 items= 13 slots=32 j=12
i=4 items= 1 slots=8 j=0
i=4 items= 2 slots=8 j=1
i=4 items= 3 slots=8 j=2
i=4 items= 4 slots=8 j=3
i=4 items= 5 slots=8 j=4
i=4 items= 6 slots=8 j=5
i=4 items= 7 slots=16 j=6
i=4 items= 8 slots=16 j=7
i=4 items= 9 slots=16 j=8
i=4 items= 10 slots=16 j=9
i=4 items= 11 slots=16 j=10
i=4 items= 12 slots=32 j=11
i=4 items= 13 slots=32 j=12
i=4 items= 14 slots=32 j=13
i=5 items= 1 slots=8 j=0
i=5 items= 2 slots=8 j=1
i=5 items= 3 slots=8 j=2
i=5 items= 4 slots=8 j=3
i=5 items= 5 slots=8 j=4
i=5 items= 6 slots=8 j=5
i=5 items= 7 slots=16 j=6
i=5 items= 8 slots=16 j=7
i=5 items= 9 slots=16 j=8
i=5 items= 10 slots=16 j=9
i=5 items= 11 slots=16 j=10
i=5 items= 12 slots=32 j=11
i=5 items= 13 slots=32 j=12
i=5 items= 14 slots=32 j=13
i=5 items= 15 slots=32 j=14
i=6 items= 1 slots=16 j=0
i=6 items= 2 slots=16 j=1
i=6 items= 3 slots=16 j=2
i=6 items= 4 slots=16 j=3
i=6 items= 5 slots=16 j=4
i=6 items= 6 slots=16 j=5
i=6 items= 7 slots=16 j=6
i=6 items= 8 slots=16 j=7
i=6 items= 9 slots=16 j=8
i=6 items= 10 slots=16 j=9
i=6 items= 11 slots=16 j=10
i=6 items= 12 slots=32 j=11
i=6 items= 13 slots=32 j=12
i=6 items= 14 slots=32 j=13
i=6 items= 15 slots=32 j=14
i=6 items= 16 slots=32 j=15
i=7 items= 1 slots=16 j=0
i=7 items= 2 slots=16 j=1
i=7 items= 3 slots=16 j=2
i=7 items= 4 slots=16 j=3
i=7 items= 5 slots=16 j=4
i=7 items= 6 slots=16 j=5
i=7 items= 7 slots=16 j=6
i=7 items= 8 slots=16 j=7
i=7 items= 9 slots=16 j=8
i=7 items= 10 slots=16 j=9
i=7 items= 11 slots=16 j=10
i=7 items= 12 slots=32 j=11
i=7 items= 13 slots=32 j=12
i=7 items= 14 slots=32 j=13
i=7 items= 15 slots=32 j=14
i=7 items= 16 slots=32 j=15
i=7 items= 17 slots=32 j=16
i=8 items= 1 slots=16 j=0

Sorry for the length, didn't know a good way to verify the resizing behavior.

@hashbackup
Copy link

hashbackup commented Jul 23, 2020

I added this test to tables.nim (this was preliminary; see actual test 8 posts down):

  block: # test that mustRehash and slotsNeeded are matched
    for i in 0 ..< 10:
      var t = initTable[int, int](i)
      for j in 0 ..< 100:
        if (slotsNeeded(t.len) > t.slots) == t.mustRehash:
          discard
        else:
          echo "fail 1: ", i, ' ', j, ' ', slotsNeeded(t.len), ' ', t.slots, ' ', t.mustRehash
        if (slotsNeeded(t.len) == t.slots) == not t.mustRehash:
          discard
        else:
          echo "fail 2: ", i, ' ', j, ' ', slotsNeeded(t.len), ' ', t.slots, ' ', t.mustRehash
        t[j] = j

Those two ifs would become doAssert. There are 2000 tests and 42 failures I'm going to look at.

@c-blake does this test look right? I had to add a slots proc, and since mustRehash uses the current table slots and counts, it's a little weird.

@c-blake
Copy link
Contributor

c-blake commented Jul 23, 2020

Well, I don't know why you are even calling initTable and t.len and all that. I would think you can just call slotsNeeded (formerly rightSize) and mustRehash. That's what my old test did. It's probably in git history...

Another thing (maybe implicit in my prior comment) I should say is that there is absolutely NO WAY for Table or HashSet to figure out magically what space-time trade off client code will want. I know that @Araq totally knows/gets this to his bones, but it bears mention here. 2/3 is just a guess. 3/4 would be. 7/8 would be. 1/2 would be. Growth by 2x is also. Python used to grow up 4x each round. I have seen all the above and more. This is all guesswork and the current situation is not parameterized at all. No expert knobs. Just some heuristic not too bad setting. A bunch of knobs surely creates a bunch of questions and explanations, of course, and we're in a tizzy over dup keys lately. So, beats me what's best. I included a bunch of expert knobs in adix, but I also have like 0 users.

Since we are talking about optimizing the way growth works, well, someone should mention these expert knobs again. Many language runtimes don't even let you presize (like Python, for example) which can make a pretty big perf delta. And Nim stdlib currently doesn't "autoshrink" a table or even interrogate it for its load factor for client code to rebuild a denser one. So, if you delete a bunch of stuff and iterate over it a bunch of times you will not have great performance. Right now deleters have to "shadow" the table population as they add/set/del with a wrapper object since data.len is not available. Not great.

@hashbackup
Copy link

Well, I don't know why you are even calling initTable and t.len and all that. I would think you can just call slotsNeeded (formerly rightSize) and mustRehash.

mustRehash doesn't take args; it uses the current table size and counts.

The 2nd if should have used <=, not ==. Now there are 6 failures and they are for cases where there are 0 or 1 items in the table - probably special cases because of the + 2 in mustRehash:

fail 1: req=0 len=0 slots=1 needed=1 rehash=true
fail 2: req=0 len=0 slots=1 needed=1 rehash=true
fail 1: req=0 len=1 slots=2 needed=2 rehash=true
fail 2: req=0 len=1 slots=2 needed=2 rehash=true
fail 1: req=1 len=1 slots=2 needed=2 rehash=true
fail 2: req=1 len=1 slots=2 needed=2 rehash=true

@c-blake
Copy link
Contributor

c-blake commented Jul 23, 2020

I feel like mustRehash used to take the cur size + 1 or something. Maybe that's why they deleted my tests. ;-)

@hashbackup
Copy link

hashbackup commented Jul 23, 2020

See new version of tiny table test below.

It doesn't detect changing the +2 to +1 in mustRehash, probably because of the >= 2 tests, but does assert if +2 is changed to +3, or the *2 *3 stuff is changed.

@c-blake
Copy link
Contributor

c-blake commented Jul 23, 2020

That's looking better. Maybe mustRehash should use + 1? or there is otherwise some off-by-one between the two directions from discreteness? Just brainstorming. I haven't thought about it right at the edge cases much since I first wrote rightSize. As I mentioned, we may want to leap from 0 to 2, then 2 on up.

@hashbackup
Copy link

hashbackup commented Jul 24, 2020

+1 does not work if size requested is 0 or 1.

For the 0 case, the initial size will be 1. When mustRehash is called to add a new item, 1-0 is not less than 1 so no resize. But you cannot add an item to a 1-slot table because search will hang on missing keys.

For the 1 case, the initial size will be 2. Before first insert, in mustRehash 2-0 is not less than 1 so no resize - correct. On the 2nd insert, 2-1 < 1 is still false, so no resize again. But this is incorrect, because there would be no vacant slot and you'd get a hang again.

See the test added below for tiny tables:

But if t.counter < 2 is changed to t.counter < 1, it hangs

Also:

Then in mustRehash, instead of (t.dataLen - t.counter < 4), it should be (t.dataLen - t.counter < 2). The reason is that mustRehash is called just before an insert, so there needs to be 1 free slot for the new item plus 1 free slot for a vacant entry to terminate missing key searches.

@hashbackup
Copy link

After thinking about it, it may be a good idea to change the items requested in json.nim all the way down to 1. Then if someone uses lots of single-item objects, there will only be two slots instead of 4 (when 2 items are requested on init).

I tested this by modifying json.nim to use (2), as committed, and the old hashcommon:

ms:json jim$ /usr/bin/time -l ./test
(x: 0.5004211581104818, y: 0.4996713815680409, z: 0.500270481798719)
        2.97 real         2.54 user         0.41 sys
1055465472  maximum resident set size

With json(1) and new hashcommon:

ms:json jim$ /usr/bin/time -l ./test
(x: 0.5004211581104818, y: 0.4996713815680409, z: 0.500270481798719)
        2.94 real         2.58 user         0.35 sys
 896163840  maximum resident set size

@hashbackup
Copy link

hashbackup commented Jul 24, 2020

The test for slotsNeeded matching mustRehash should be this (gotta still do the insert):

  block: # test that mustRehash and slotsNeeded are matched
    for i in 0 ..< 100:
      var t = initTable[int, int](i)
      for j in 0 ..< 1000:
        if i >= 2 or j >= 2:
          doAssert (slotsNeeded(t.len) > t.data.len) == t.mustRehash
          doAssert (slotsNeeded(t.len) <= t.data.len) == not t.mustRehash
        t[j] = j

@c-blake
Copy link
Contributor

c-blake commented Jul 24, 2020

I found my old test. I guess TC removed it, and I neglected to restore in my reversal of all his pseudorandom probing churn back in February/March. 8c22518 .

Seems that was more one-sided anyway..not checking "tightness"/non-wasting of space. So maybe there should be two loops, one checking one side and another the other or like you have your test above.

Honestly, I think moving mustRehash back to taking parameters and then having a simple formal relationship between mustRehash()-slotsNeeded() as inverses (or maybe "tight" semi-inverse is the right term) would be the most clear way to re-org this..or maybe with some off-by-one aspect in there if you want mustRehash to take the pre-incremented counter. Besides being more clear, you would not need to create tables to test the relationship.

@c-blake
Copy link
Contributor

c-blake commented Jul 24, 2020

Also, since we do not know how big data slots are, the idea of a data-allocated-on-first insert should not be forgotten. But it can just jump to 2 slots.

This change has broader implications like checks on lookups for an empty table. (Might have tiny impact on larger tables since they always have to check to support the possibility that nothing is there, and people do seem to care more about even tiny perf deltas on larger tables than about memory optimizing tiny ones..)

@hashbackup
Copy link

Honestly, I think moving mustRehash back to taking parameters

One thing I like about passing the table to functions like mustRehash is that if the decision becomes more complicated, mustRehash will have access to all the data needed. You do this in adix. So if loadFactor, maxProbeDepth, etc are added as initTable options, mustRehash will have access to them without adding new parameters.

This change has broader implications like checks on lookups for an empty table

What implications? These checks are already present in the table code. For lookups, it checks to see if t.data.len is 0 and returns false (lookup failed). For inserts, it calls checkIfInitialized before calling mustRehash, and if t.data.len is 0, it calls initTable with the defaultInitialSize.

the idea of a data-allocated-on-first insert should not be forgotten. But it can just jump to 2 slots.

This already happens, as explained, though instead of calling initTable with 1 (which would create a 2-slot table), it uses defaultInitialSize. That's probably right for most use cases, though for the JSON case, creating a 64-slot table may not be ideal. So I had to test that :-) by changing checkIfInitialized to create a 1-item table instead and running the JSON benchmark. It didn't make any difference.

@c-blake
Copy link
Contributor

c-blake commented Jul 24, 2020

These checks are already present in the table code.

I stand corrected. :-)

It's true having a handle on the table gives you more options. But adix also resizes based on search depth not a guess at load factor to control search depth. So, this slotsNeeded-mustRehash inverse relation is not important. My re-org idea was really contingent upon staying with the current "control bad perf with a load factor guess" instead of "bound it directly, but with harder to predict space usage". No one really responded to that idea. :-)

@c-blake
Copy link
Contributor

c-blake commented Jul 24, 2020

Anyway, a TLDR for @narimiran - I am 100% behind a 0/2-slot minimum table & we seem on the way to having a good smooth and minimal growth policy in the old control load factor domain, but there are many possibilities related to how smart/expert we want stdlib tables to be.

@narimiran narimiran deleted the json-smaller-init-size branch December 6, 2020 20:35
mildred pushed a commit to mildred/Nim that referenced this pull request Jan 11, 2021
There was a recent `rightSize` change in tables.nim, so the existing
value (4) was creating too large tables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants