json.nim: smaller init size #15048

narimiran · 2020-07-23T12:53:46Z

There was a recent rightSize change in tables.nim, so the existing
value (4) was creating too large tables.

There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.

hashbackup · 2020-07-23T15:53:59Z

The real bug is in slotsNeeded in hashcommon.nim: nextPowerOfTwo(count * 3 div 2 + 4)

slotsNeeded should add 1, not 4 (no idea where 4 came from). Then requesting a 1-item table will give 2 slots, 2 items -> 4 slots, 3 items -> 8 (not ideal, could be 4), 4 -> 8, etc. 4-item tables need an empty slot so 8 is the smallest possible. As written, slotsNeeded yields 8 for a 1-3 item table, 16 for a 4-item table, etc. It doesn't matter for large tables of course, but for tiny tables it does.

Then in mustRehash, instead of (t.dataLen - t.counter < 4), it should be (t.dataLen - t.counter < 2). The reason is that mustRehash is called just before an insert, so there needs to be 1 free slot for the new item plus 1 free slot for a vacant entry to terminate missing key searches.

narimiran · 2020-07-23T17:53:24Z

@c-blake your opinion on the above? (You're the author of mustRehash, IIRC)

c-blake · 2020-07-23T20:09:18Z

I believe I left mustRehash completely as @Araq wrote it so many years ago 2/3 and 4 included, though re-arrangement of what code is in what files might obscure that. :-) I did add the old rightSize-now slotsNeeded, though.

I agree that "minimal tables" are sometimes useful when you have zillions of small tables. I could see json being one such case. So, I am +1 on the idea of lowering the minimum size.

I only would say two things. A) make sure that mustRehash and slotsNeeded are inverses in the proper sense in that mustRehash(slotsNeeded(n)) == false and that the boundary is tight mustRehash(slotsNeeded(n) + 1) == true, I think expresses that. Maybe not. I think what I mean by tight is clear, though..never waste memory. I had some tests somewhere in Nim about that at some point, but I could not find them when I looked for them recently.

and B) It might be possible to do even better than the idea here and have an empty table use no space at all (like seq[T]) and maybe its first non-zero .data.len needs to be 2 in order to terminate searches with an empty slot. I don't know if the json stuff has any purely empty tables or not. Even if json does not, it is easy to imagine it, especially with zillions of instances when the per instance cost matters most. (We might be able to save HCell.sizeof*2*zillions.)

Beyond these two things, while I have your attention, I feel I should say that in my adix I was experimenting, successfully I think, with an entirely alternate resize protocol that better leverages great hash functions relative to key sets. The idea there is to measure search depth on every insert and only grow the table when that depth is greater than some rational number numer/denom multiplied by log2(data.len) (the natural scale of the worst case search in open addressed hash tables, Robin Hood or not).

For example, if you happen to have a great hash, like -d:nimHashInt1 and a dense sequence of int keys, one of @Araq's favorite examples, then you get to use "100% - 1" slots of data[] with only length 1 collision clusters! Pretty great, right?

Besides that search depth measurement and growth rule, you do need a "memory attack" block for cases where, for example, someone set hash(x)=1234 by accident. So, in addition to searches being "too deep/long" the table also has to be > 50% occupied. So, then a runaway expansion from a bad hash is impossible. Rust used something like this for several years (like 2016-2019) of high profile people banging on their table. I think at first they forgot the memory attack block, but you know...Buncha FOSS people learning their way as they go. :-)

Anyway, a hash over-tuned for expected keys that winds up wrong about realized keys can cause other problems and adix has other mitigations for that, but that is all a little independent of the growth policy which is what you asked about.

hashbackup · 2020-07-23T20:17:24Z

As a double-check, I changed slotsNeeded and mustRehash in hashcommon.nim to:

proc mustRehash[T](t: T): bool {.inline.} =
  # If this is changed,reake sure to synchronize it with `slotsNeeded` below
  assert(t.dataLen > t.counter)
  result = (t.dataLen * 2 < t.counter * 3) or (t.dataLen - t.counter < 2)

proc slotsNeeded(count: Natural): int {.inline.} =
  # Make sure to synchronize with `mustRehash` above
  result = nextPowerOfTwo(count * 3 div 2 + 1)

Then added this tiny table test to tables.nim:

  block: # test tiny table sizes
    for i in 0 ..< 10:
      var t = initTable[int, int](i)
      for j in 0 ..< i + 10:
        t[j] = j
        doAssert t.len == j+1
        doAssert not (1000 in t)
        doAssert j in t

It works. But if t.counter < 2 is changed to t.counter < 1, it hangs, because there is not a vacant slot for missing keys so the search for key 1000 never terminates.

What is harder to check is that it is resizing to the minimum possible size, because there is no t.slots to get number of slots. I made t.data public so t.data.len is visible, added an echo to the test, and get this output (this is echoed after the insert occurs):

i=0 items= 1 slots=2 j=0
i=0 items= 2 slots=4 j=1
i=0 items= 3 slots=4 j=2
i=0 items= 4 slots=8 j=3
i=0 items= 5 slots=8 j=4
i=0 items= 6 slots=8 j=5
i=0 items= 7 slots=16 j=6
i=0 items= 8 slots=16 j=7
i=0 items= 9 slots=16 j=8
i=0 items= 10 slots=16 j=9
i=1 items= 1 slots=2 j=0
i=1 items= 2 slots=4 j=1
i=1 items= 3 slots=4 j=2
i=1 items= 4 slots=8 j=3
i=1 items= 5 slots=8 j=4
i=1 items= 6 slots=8 j=5
i=1 items= 7 slots=16 j=6
i=1 items= 8 slots=16 j=7
i=1 items= 9 slots=16 j=8
i=1 items= 10 slots=16 j=9
i=1 items= 11 slots=16 j=10
i=2 items= 1 slots=4 j=0
i=2 items= 2 slots=4 j=1
i=2 items= 3 slots=4 j=2
i=2 items= 4 slots=8 j=3
i=2 items= 5 slots=8 j=4
i=2 items= 6 slots=8 j=5
i=2 items= 7 slots=16 j=6
i=2 items= 8 slots=16 j=7
i=2 items= 9 slots=16 j=8
i=2 items= 10 slots=16 j=9
i=2 items= 11 slots=16 j=10
i=2 items= 12 slots=32 j=11
i=3 items= 1 slots=8 j=0   <--- this could be 4 slots but the math doesn't work
i=3 items= 2 slots=8 j=1   <--- and would have to be special-cased for 3 items
i=3 items= 3 slots=8 j=2
i=3 items= 4 slots=8 j=3   <--- here's where resize from 4 to 8 sshould occur
i=3 items= 5 slots=8 j=4
i=3 items= 6 slots=8 j=5
i=3 items= 7 slots=16 j=6
i=3 items= 8 slots=16 j=7
i=3 items= 9 slots=16 j=8
i=3 items= 10 slots=16 j=9
i=3 items= 11 slots=16 j=10
i=3 items= 12 slots=32 j=11
i=3 items= 13 slots=32 j=12
i=4 items= 1 slots=8 j=0
i=4 items= 2 slots=8 j=1
i=4 items= 3 slots=8 j=2
i=4 items= 4 slots=8 j=3
i=4 items= 5 slots=8 j=4
i=4 items= 6 slots=8 j=5
i=4 items= 7 slots=16 j=6
i=4 items= 8 slots=16 j=7
i=4 items= 9 slots=16 j=8
i=4 items= 10 slots=16 j=9
i=4 items= 11 slots=16 j=10
i=4 items= 12 slots=32 j=11
i=4 items= 13 slots=32 j=12
i=4 items= 14 slots=32 j=13
i=5 items= 1 slots=8 j=0
i=5 items= 2 slots=8 j=1
i=5 items= 3 slots=8 j=2
i=5 items= 4 slots=8 j=3
i=5 items= 5 slots=8 j=4
i=5 items= 6 slots=8 j=5
i=5 items= 7 slots=16 j=6
i=5 items= 8 slots=16 j=7
i=5 items= 9 slots=16 j=8
i=5 items= 10 slots=16 j=9
i=5 items= 11 slots=16 j=10
i=5 items= 12 slots=32 j=11
i=5 items= 13 slots=32 j=12
i=5 items= 14 slots=32 j=13
i=5 items= 15 slots=32 j=14
i=6 items= 1 slots=16 j=0
i=6 items= 2 slots=16 j=1
i=6 items= 3 slots=16 j=2
i=6 items= 4 slots=16 j=3
i=6 items= 5 slots=16 j=4
i=6 items= 6 slots=16 j=5
i=6 items= 7 slots=16 j=6
i=6 items= 8 slots=16 j=7
i=6 items= 9 slots=16 j=8
i=6 items= 10 slots=16 j=9
i=6 items= 11 slots=16 j=10
i=6 items= 12 slots=32 j=11
i=6 items= 13 slots=32 j=12
i=6 items= 14 slots=32 j=13
i=6 items= 15 slots=32 j=14
i=6 items= 16 slots=32 j=15
i=7 items= 1 slots=16 j=0
i=7 items= 2 slots=16 j=1
i=7 items= 3 slots=16 j=2
i=7 items= 4 slots=16 j=3
i=7 items= 5 slots=16 j=4
i=7 items= 6 slots=16 j=5
i=7 items= 7 slots=16 j=6
i=7 items= 8 slots=16 j=7
i=7 items= 9 slots=16 j=8
i=7 items= 10 slots=16 j=9
i=7 items= 11 slots=16 j=10
i=7 items= 12 slots=32 j=11
i=7 items= 13 slots=32 j=12
i=7 items= 14 slots=32 j=13
i=7 items= 15 slots=32 j=14
i=7 items= 16 slots=32 j=15
i=7 items= 17 slots=32 j=16
i=8 items= 1 slots=16 j=0

Sorry for the length, didn't know a good way to verify the resizing behavior.

hashbackup · 2020-07-23T21:00:03Z

I added this test to tables.nim (this was preliminary; see actual test 8 posts down):

  block: # test that mustRehash and slotsNeeded are matched
    for i in 0 ..< 10:
      var t = initTable[int, int](i)
      for j in 0 ..< 100:
        if (slotsNeeded(t.len) > t.slots) == t.mustRehash:
          discard
        else:
          echo "fail 1: ", i, ' ', j, ' ', slotsNeeded(t.len), ' ', t.slots, ' ', t.mustRehash
        if (slotsNeeded(t.len) == t.slots) == not t.mustRehash:
          discard
        else:
          echo "fail 2: ", i, ' ', j, ' ', slotsNeeded(t.len), ' ', t.slots, ' ', t.mustRehash
        t[j] = j

Those two ifs would become doAssert. There are 2000 tests and 42 failures I'm going to look at.

@c-blake does this test look right? I had to add a slots proc, and since mustRehash uses the current table slots and counts, it's a little weird.

c-blake · 2020-07-23T21:05:34Z

Well, I don't know why you are even calling initTable and t.len and all that. I would think you can just call slotsNeeded (formerly rightSize) and mustRehash. That's what my old test did. It's probably in git history...

Another thing (maybe implicit in my prior comment) I should say is that there is absolutely NO WAY for Table or HashSet to figure out magically what space-time trade off client code will want. I know that @Araq totally knows/gets this to his bones, but it bears mention here. 2/3 is just a guess. 3/4 would be. 7/8 would be. 1/2 would be. Growth by 2x is also. Python used to grow up 4x each round. I have seen all the above and more. This is all guesswork and the current situation is not parameterized at all. No expert knobs. Just some heuristic not too bad setting. A bunch of knobs surely creates a bunch of questions and explanations, of course, and we're in a tizzy over dup keys lately. So, beats me what's best. I included a bunch of expert knobs in adix, but I also have like 0 users.

Since we are talking about optimizing the way growth works, well, someone should mention these expert knobs again. Many language runtimes don't even let you presize (like Python, for example) which can make a pretty big perf delta. And Nim stdlib currently doesn't "autoshrink" a table or even interrogate it for its load factor for client code to rebuild a denser one. So, if you delete a bunch of stuff and iterate over it a bunch of times you will not have great performance. Right now deleters have to "shadow" the table population as they add/set/del with a wrapper object since data.len is not available. Not great.

hashbackup · 2020-07-23T21:25:11Z

Well, I don't know why you are even calling initTable and t.len and all that. I would think you can just call slotsNeeded (formerly rightSize) and mustRehash.

mustRehash doesn't take args; it uses the current table size and counts.

The 2nd if should have used <=, not ==. Now there are 6 failures and they are for cases where there are 0 or 1 items in the table - probably special cases because of the + 2 in mustRehash:

fail 1: req=0 len=0 slots=1 needed=1 rehash=true
fail 2: req=0 len=0 slots=1 needed=1 rehash=true
fail 1: req=0 len=1 slots=2 needed=2 rehash=true
fail 2: req=0 len=1 slots=2 needed=2 rehash=true
fail 1: req=1 len=1 slots=2 needed=2 rehash=true
fail 2: req=1 len=1 slots=2 needed=2 rehash=true

c-blake · 2020-07-23T21:30:29Z

I feel like mustRehash used to take the cur size + 1 or something. Maybe that's why they deleted my tests. ;-)

hashbackup · 2020-07-23T21:41:51Z

See new version of tiny table test below.

It doesn't detect changing the +2 to +1 in mustRehash, probably because of the >= 2 tests, but does assert if +2 is changed to +3, or the *2 *3 stuff is changed.

c-blake · 2020-07-23T21:46:37Z

That's looking better. Maybe mustRehash should use + 1? or there is otherwise some off-by-one between the two directions from discreteness? Just brainstorming. I haven't thought about it right at the edge cases much since I first wrote rightSize. As I mentioned, we may want to leap from 0 to 2, then 2 on up.

hashbackup · 2020-07-24T01:00:40Z

+1 does not work if size requested is 0 or 1.

For the 0 case, the initial size will be 1. When mustRehash is called to add a new item, 1-0 is not less than 1 so no resize. But you cannot add an item to a 1-slot table because search will hang on missing keys.

For the 1 case, the initial size will be 2. Before first insert, in mustRehash 2-0 is not less than 1 so no resize - correct. On the 2nd insert, 2-1 < 1 is still false, so no resize again. But this is incorrect, because there would be no vacant slot and you'd get a hang again.

See the test added below for tiny tables:

But if t.counter < 2 is changed to t.counter < 1, it hangs

Also:

Then in mustRehash, instead of (t.dataLen - t.counter < 4), it should be (t.dataLen - t.counter < 2). The reason is that mustRehash is called just before an insert, so there needs to be 1 free slot for the new item plus 1 free slot for a vacant entry to terminate missing key searches.

hashbackup · 2020-07-24T01:21:34Z

After thinking about it, it may be a good idea to change the items requested in json.nim all the way down to 1. Then if someone uses lots of single-item objects, there will only be two slots instead of 4 (when 2 items are requested on init).

I tested this by modifying json.nim to use (2), as committed, and the old hashcommon:

ms:json jim$ /usr/bin/time -l ./test
(x: 0.5004211581104818, y: 0.4996713815680409, z: 0.500270481798719)
        2.97 real         2.54 user         0.41 sys
1055465472  maximum resident set size

With json(1) and new hashcommon:

ms:json jim$ /usr/bin/time -l ./test
(x: 0.5004211581104818, y: 0.4996713815680409, z: 0.500270481798719)
        2.94 real         2.58 user         0.35 sys
 896163840  maximum resident set size

hashbackup · 2020-07-24T02:52:17Z

The test for slotsNeeded matching mustRehash should be this (gotta still do the insert):

  block: # test that mustRehash and slotsNeeded are matched
    for i in 0 ..< 100:
      var t = initTable[int, int](i)
      for j in 0 ..< 1000:
        if i >= 2 or j >= 2:
          doAssert (slotsNeeded(t.len) > t.data.len) == t.mustRehash
          doAssert (slotsNeeded(t.len) <= t.data.len) == not t.mustRehash
        t[j] = j

c-blake · 2020-07-24T11:37:56Z

I found my old test. I guess TC removed it, and I neglected to restore in my reversal of all his pseudorandom probing churn back in February/March. 8c22518 .

Seems that was more one-sided anyway..not checking "tightness"/non-wasting of space. So maybe there should be two loops, one checking one side and another the other or like you have your test above.

Honestly, I think moving mustRehash back to taking parameters and then having a simple formal relationship between mustRehash()-slotsNeeded() as inverses (or maybe "tight" semi-inverse is the right term) would be the most clear way to re-org this..or maybe with some off-by-one aspect in there if you want mustRehash to take the pre-incremented counter. Besides being more clear, you would not need to create tables to test the relationship.

c-blake · 2020-07-24T11:49:11Z

Also, since we do not know how big data slots are, the idea of a data-allocated-on-first insert should not be forgotten. But it can just jump to 2 slots.

This change has broader implications like checks on lookups for an empty table. (Might have tiny impact on larger tables since they always have to check to support the possibility that nothing is there, and people do seem to care more about even tiny perf deltas on larger tables than about memory optimizing tiny ones..)

hashbackup · 2020-07-24T12:47:10Z

Honestly, I think moving mustRehash back to taking parameters

One thing I like about passing the table to functions like mustRehash is that if the decision becomes more complicated, mustRehash will have access to all the data needed. You do this in adix. So if loadFactor, maxProbeDepth, etc are added as initTable options, mustRehash will have access to them without adding new parameters.

This change has broader implications like checks on lookups for an empty table

What implications? These checks are already present in the table code. For lookups, it checks to see if t.data.len is 0 and returns false (lookup failed). For inserts, it calls checkIfInitialized before calling mustRehash, and if t.data.len is 0, it calls initTable with the defaultInitialSize.

the idea of a data-allocated-on-first insert should not be forgotten. But it can just jump to 2 slots.

This already happens, as explained, though instead of calling initTable with 1 (which would create a 2-slot table), it uses defaultInitialSize. That's probably right for most use cases, though for the JSON case, creating a 64-slot table may not be ideal. So I had to test that :-) by changing checkIfInitialized to create a 1-item table instead and running the JSON benchmark. It didn't make any difference.

c-blake · 2020-07-24T12:54:05Z

These checks are already present in the table code.

I stand corrected. :-)

It's true having a handle on the table gives you more options. But adix also resizes based on search depth not a guess at load factor to control search depth. So, this slotsNeeded-mustRehash inverse relation is not important. My re-org idea was really contingent upon staying with the current "control bad perf with a load factor guess" instead of "bound it directly, but with harder to predict space usage". No one really responded to that idea. :-)

c-blake · 2020-07-24T13:00:08Z

Anyway, a TLDR for @narimiran - I am 100% behind a 0/2-slot minimum table & we seem on the way to having a good smooth and minimal growth policy in the old control load factor domain, but there are many possibilities related to how smart/expert we want stdlib tables to be.

There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.

json.nim: smaller init size

a1f7c66

There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.

Araq merged commit 14d16c2 into nim-lang:devel Jul 23, 2020

narimiran deleted the json-smaller-init-size branch December 6, 2020 20:35

mildred pushed a commit to mildred/Nim that referenced this pull request Jan 11, 2021

json.nim: smaller init size (nim-lang#15048)

a708190

There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.nim: smaller init size #15048

json.nim: smaller init size #15048

narimiran commented Jul 23, 2020

hashbackup commented Jul 23, 2020

narimiran commented Jul 23, 2020

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020

hashbackup commented Jul 23, 2020 •

edited

Loading

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020 •

edited

Loading

c-blake commented Jul 23, 2020

hashbackup commented Jul 24, 2020 •

edited

Loading

hashbackup commented Jul 24, 2020

hashbackup commented Jul 24, 2020 •

edited

Loading

c-blake commented Jul 24, 2020

c-blake commented Jul 24, 2020

hashbackup commented Jul 24, 2020

c-blake commented Jul 24, 2020

c-blake commented Jul 24, 2020

json.nim: smaller init size #15048

json.nim: smaller init size #15048

Conversation

narimiran commented Jul 23, 2020

hashbackup commented Jul 23, 2020

narimiran commented Jul 23, 2020

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020

hashbackup commented Jul 23, 2020 • edited Loading

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020

c-blake commented Jul 23, 2020

hashbackup commented Jul 23, 2020 • edited Loading

c-blake commented Jul 23, 2020

hashbackup commented Jul 24, 2020 • edited Loading

hashbackup commented Jul 24, 2020

hashbackup commented Jul 24, 2020 • edited Loading

c-blake commented Jul 24, 2020

c-blake commented Jul 24, 2020

hashbackup commented Jul 24, 2020

c-blake commented Jul 24, 2020

c-blake commented Jul 24, 2020

hashbackup commented Jul 23, 2020 •

edited

Loading

hashbackup commented Jul 23, 2020 •

edited

Loading

hashbackup commented Jul 24, 2020 •

edited

Loading

hashbackup commented Jul 24, 2020 •

edited

Loading