Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMBitset checklist #245

Closed
42 of 44 tasks
yakra opened this issue Apr 10, 2023 · 12 comments · Fixed by TravelMapping/DataProcessing#592
Closed
42 of 44 tasks

TMBitset checklist #245

yakra opened this issue Apr 10, 2023 · 12 comments · Fixed by TravelMapping/DataProcessing#592

Comments

@yakra
Copy link
Owner

yakra commented Apr 10, 2023

Benchmark:

  • Route::clinched_by_traveler: branching vs indirection
    Branching is faster.
  • Graph generation with different units
    Bigger is faster.
  • Explicitly inline matching_vertices_and_edges (Change name? It finds travelers too)
    Slightly faster on all machines except lab2. (Noise?)
  • Try branchless variant of eaa6:
    for (size_t index = 0; index < traveler_lists.size(); ++index)
    	code[index/4] += clinched_by(traveler_lists[index]) << index%4;
    Branching is faster.
  • Revert eaa6; apply 4e05 directly to prev commit
    For Computing Stats, 96e9...
    • is THE top performer so far for lab1, lab3, lab4.
    • outperforms 4e05 on BiggaTomato.
    • on lab2 lags behind top performer 4e05 by only 3.8 ms. Just noise? A very tight race here.
  • 5c76 inconclusive, but does appear slightly slower for userlogs, very consistently. Try:
    • comparing against 6922 for a more apples-to-apples comparison.
    • comparing against the final selected commit once all the dust settles (see 4a64).
    • Out of curiosity, is there a diff if I build with GCC instead of clang? Yes.
  • What if traveler_lists is nixed altogether?
    As we iterate thru traveler_set, just keep a count instead of creating a vector.
    The trade-off is doing one more iteration thru traveler_set at the end of the traveled graph for traveler names.
    Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations?
    Preliminary results: helps on BiggaTomato; hurts on lab1. Be interesting to see results on bsdlab, at higher thread counts, and after doing more work on the RAM bandwidth bottleneck.
    Final: No speed advantage. Leaving as-is. Maybe re-examine after more RAM bandwidth improvements are implemented.

Try out:

  • What happens to traveled graphs if a devel system comes before concurrent a/p in systems.csv?
  • How much does initial HGEdge construction slow down if I force an active/preview canonical HighwaySegment?
    ~ 0.01 - 0.02 s on BiggaTomato -- 0.9 - 1.9 % more time.
  • Simple iteration optimization (with different units)
    The trade-off: Larger units = skip back farther but less often.
    • 64-bit is clearly suboptimal.
    • 8, 16 & 32-bit all very close to margin of error. Maybe a slight edge for 32-bit; makes sense to use it anyway because good |= performance.
  • Complex iteration optimization
    • 8-bit f749 underperforms 8-32-bit simple iteration on lab{1..4}; slight lead on BiggaTomato.
    • 16-bit ec4e is 1st place for BiggaTomato & lab1; underperforms f749 & even 64-bit f8cf on lab2.
      Other machines TBD.
      • [bits >> 1] solution
      • ternaries
        Both underperform ec4e on BiggaTomato & lab1. Successively better on lab2 though, with ternaries 1st place overall & [bits >> 1] falling between 16 & 32-bit SIO2. Other machines TBD.
    • How much time does the 15-bit lookup table take to construct? Does this outweigh the advantage of using 16-bit ComItOpt?
      • About 0.3 ms on BiggaTomato.
      • N/A -- there isn't an advantage to using 16-bit ComItOpt.
    • LOL what if I constexpr the damn thing by brute force? Never mind. Nothing to be gained by doing this.
    • 32k is 1/16 of 512k. Will ec4e perform poorly on Epoch? No. Performs well; outperforms SIO2. Similar to BiggaTomato.
  • Pointer punning |=
    Try simplified versions for larger units:
    • 64-32-16-bit
    • 64-16-bit
    • 64-32-bit
  • Replace TravelerList::traveler_num with one
    unsigned int* traveler_num = new unsigned int[TravelerList::allusers.size()]; per thread.
    Index via for (TravelerList *t : traveler_lists) traveler_nums[t-TravelerList::allusers.data()] = travnum++;
  • A variant of eaa6 with TMBitset [] operator? LOLNOPE. Different vectors (everything vs subset); indices don't match up.

Kinda both:

  • Init TravelerLists at beginning; init segments with size not capacity
  • Or better yet, init segments with TravelerList::ids.size()
    Segments set via TMArray::size, set via TravelerList::ids.size()
  • Timestamp for first task, with & without. Python Too?
  • TMArray<TravelerList> means threaded construction in place (via placement new) without separate read_list function.
  • TMBitset<HGVertex*> #251

Clean up:

  • fix comment
  • Constexpr the static edge format constants
  • Template specialization
  • As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp
  • Extraneous else after continue; parens can be clarified
    else if ((w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.front()
    && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.back())
    || (w->vertex->incident_c_edges.front() == w->vertex->incident_t_edges.back()
    && w->vertex->incident_c_edges.back() == w->vertex->incident_t_edges.front()))
    new HGEdge(w->vertex, HGEdge::collapsed | HGEdge::traveled);
  • extraneous visibility == 1 check just before that
  • (4db4) maxbits less useful in SimItOpt2. Delete it; replace with 8*sizeof(unit); let -1 and +1 cancel out.
  • bits should be unsigned char in pun branch. Dumb luck that it ran without errors. Fixed for ComItOpt.
    Never mind. Using a branch that retains unit instead.
  • uint8_tetc.
  • Explicit instantiation?
  • Inlining MV&E (2d1b) means #include "../../templates/TMBitset.cpp" not needed in HighwayGraph.h. Can lose the include guard too LOL. 🤠
  • segments SQL table: only iterate clinched_by for active/preview systems
  • (c8b9) Check for diffs due to constant folding: 8/sizeof(unit)
  • Comment for != and |= operators
  • Review TMBitset variable names
  • Cast (unit)1 before <<, lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch.
  • (4db4) Switch: are additions in for loop reordered? Make it look pretty!
  • nix () and add_value until needed
@yakra
Copy link
Owner Author

yakra commented Apr 12, 2023

find crashes via

for file in sulogs/2023-04-06/siteupdate_{a423,4448,8fba,8d55,7828,96e9,2d1b,5c76}*; do
  echo -e "$file\t" `tail -n 1 $file \
  | sed 's~\[[0-9.]*\]~~' \
  | tr -d .`; done \
| grep -v 'Total run time:' \
| sed -r 's~.*siteupdate_(....)-.*~\1~' \
| uniq

@yakra
Copy link
Owner Author

yakra commented Apr 16, 2023

dead branches

  • SimItOpt2 with sign bit bug before reset:
    • dd1b6e0 long
    • a8a3db9 int
    • 9280ec4 short
      OK, not dead, but deprecated binary & logs. Hard reset here; bugfix commit added
  • tlist_canon @ d37dfe0979bc5a470dad5851147850503d096bdc

@yakra
Copy link
Owner Author

yakra commented Apr 23, 2023

commits

c incl? descr comments
a5a7 yes contig RAM Prerequisite for everything else
ecff yes TMB 1st draft The main point of all this :)
d704 yes t_l via TMB Small speedup. Saves RAM. Potential for improvements via iteration and |= optimizations.
a009 yes t_l list->vec Not slower. The Right Thing To Do. Saves RAM. Locality. Is index used in future commits? Only in eaa6, which won't be used as it hurts performance.
What if traveler_lists is nixed altogether, and we just iterate traveler_set one more time at the end? Or, what if, instead of doing reallocations/copies, we just reserve(TravelerList::allusers.size()) after declaring?
Re-examine nixing in future.
91a0 yes !extr include Housekeeping. Rebase earlier?
7ce7 yes userlog warn Unrelated, and doesn't appear to affect performance any, fnord. Rebase earlier?
f977 no t_l as ref Pass a bool instead; see 4504.
Nix extraneous traveler_lists creation for master graph.
eaa6 no nix tl::tn Hurts performance. See 9fdc et al instead.
4e05 no avoid iterate Child of eaa6. See 96e9 even though it's the same diff. :P
6922 no ctor / init Any changes to performance? See next commit. No: See 7415.
5c76 no integral type Userlogs slightly slower? Test again once dust settles? No: See 4a64.
b4ab no names & order Arg order just makes sense. Name changes are more related to prev commit. No: See c896.
e61a no ++u No change to binary. Done on its own in some future commits, including 47a7.
a423 yes branching CBT Faster userlogs. Rebase earlier?
Maybe try again once HighwaySegments are stored sequentially?
4448 no short unit branch; affects |= only. Prefer flavors with iterator optimization.
8fba no int unit branch; affects |= only. Prefer flavors with iterator optimization.
8d55 no long unit branch; affects |= only. Prefer flavors with iterator optimization.
7828 no no-branch CBC Hurts performance
96e9 yes avoid iterate 4e05 cherrypicked. Reorder additions in for loop; make it look pretty.
2d1b yes inline mv&e Slight speedup, simpler prototypes & includes
f28c no long bugfix unit branch; affects |= only. Prefer flavors with iterator optimization.
953a no ! unit param Not strictly necessary for pointer-punning |=, but cleans things up.
No: Different units may be useful for scenarios other than clinched_by. E.g., uint64_t may work well for HGVertex subgraph membership.
03ec no punning |= Evaluate WRT straight-up unit |=. Prefer flavors with iterator optimization; 3 are simplified for 16/32-bit units; see below.
7958 no SimItOpt1 SimItOpt2 outperforms this.
3910 no SimItOpt2 32 > 16 > 8-bit for iterating clinched_by. Additionally, allows for better |= optimization.
1be1 no SIO1 short SimItOpt2 outperforms this.
87e4 no SIO1 int "
2cbb no SIO1 long "
9280 no SIO2 short 32 > 16 > 8-bit for iterating clinched_by. Additionally, allows for better |= optimization.
9347 no sign bit fix "
acc9 yes unsigned int "
8269 no unsigned long Performs worse than 8/16/32-bit everywhere except epoch.
f8cf no ulong bugfix "
f749 no ComItOpt Evaluate against SimItOpt. All tasks. Try with per-unit |=. <--Never mind.
No: Underperforms SIO2 on all lab machines (but not BiggaTomato & epoch though those aren't the target). KISS principle.
9fdc no trav_nums May not do much on its own. Try again after other speedups are in place. See 8be1.
bc9d no TMB it idx Pro: No arithmetic for clinchedby_code. Con: Still adding b.array+index everywhere else (deref).
0b69 no iterator ptr Pro: No arithmetic for dereferencing. Con: CBC subtracts out.
Use 4db4 instead.
^ ^^ ^ ^^ Both are superior to 9fdc, which adds for everything, then subtracts out for CBC.
Test both sans 9fdc. Which gets called more?
• 69% clinched_by in clinchedby_code @ 36,193,165, which wants indices.
• 31% in everything else @ 16,024,183, which want pointers.
(8,000,360 clinched_by in stats & SQL + 23,463 traveler_set in MV&E)
So on paper, bc9d wins. But in practice we're within margin of error, with 0b69 faster on the 4 non-Ubuntu (coincidence?) machines.
If foregoing or at least deferring 9fdc because no clear benefit, it makes sense to use 0b69.
ec4e no 16-bit it opt Wins on epoch, BT & lab1; lags behind SimItOpt2 everywhere else, even before accounting for lookup array setup. Constexpr would be expensive for program & code size. Ugly changes to code, a namespace I wouldn't otherwise need.
16bb no [bits >> 1] Only needed for ComItOpt, which looks like it's not going to pan out. Only performs better on lab2, so within margin of error at best.
2d94 no ternaries 1st place on lab2 by 0.5 ms & lab3 by 0.7; lags behind SimItOpt2 everywhere else except BT, where it's 2nd place behind ec4e, and more arguably epoch, where it's slower than 8 & 16-bit, but faster than 32-bit (which looks like it'll be the selected alternative). But again, BT & epoch aren't target machines. Keep it simple.
fc17 no punning SIO2 Bigger is better.
72e2 no 64-32-16 pun "
c310 no 64-16-bit pun "
47a7 yes 64-32-bit pun Bigger is better. Fewest branches & loops. 32-bit is fastest for iteration too.
4db4 yes iterator ptr Faster.
4504 yes "0" Rebase earlier
7f87 yes trav Better on paper; != op replaced with just sending an existing variable. Rename variable to is_traveled for readability.
05d7 no reserve No definitive diff; no-build may be faster.
8be1 ... TL::nums No apparent diffs. Revisit when more RAM bandwidth optimizations are in place?
c896 yes arg order Measures faster, but probably just noise.
7415 yes ctor / init Keep the initialization part; no diffs to binary from that. Array start & data pointer go from 8 to 24 B apart, increasing probability of being on different cache lines. That's got to explain the drop in UserLog performance. Therefore, reorder to get them consecutive again; see 01d6.
4a64 yes integral type This time, no change to binary from 7415.
6da9 no alignas(32) Increases sizeof(HighwaySegment) from 112 to 128. Does help UserLog performance on all machines, but hurts most other tasks on most machines. Combined, still a slight advantage overall but future commits do better...
01d6 yes reorder membr start and data adjacent again, restoring old same/different cache line probability. Data occasionally on different cache lines appears to be outweighed by not increasing sizeof(HighwaySegment), except for bandwidth-constrained machines BiggaTomato, lab1 & bsdlab. 01e7 makes this a non-issue...
01e7 yes cbt_index Not only avoids reading start (potentially from another cache line), avoids redundant recalculation of the index. A clear advantage over all prev commits on all Linux boxes.
48c8 yes part add_idx add_clinched_by takes index, calculated once per chopped route, in store_traveled_segments.
28af no local_index No clear benefit; more diffs. Keep it simple.
c8b9 yes const fold readability
adcf no 1 index/trav No clear benefit; more diffs. Keep it simple.
6634 yes whitespace
9536 yes constexpr no diff to binary on BiggaTomato
ecef yes ST aug index

@yakra
Copy link
Owner Author

yakra commented May 1, 2023

Functions, operators, etc.

  • insert
    No real changes, because constant folding
    all via HighwaySegment::add_clinched_by:
    • ConcAug
    • ReadList via store_traveled_segments
  • operator ()
    No real changes, because constant folding
    • UserLog: Route::clinched_by_traveler
  • operator !=
    Only changes in 953a (nix unit parameter) & children (ComItOpt et al): multiply by sizeof(unit) before memcmp
    • Graphs: edge compression
  • operator |=
    Type-punning variants
    • Graphs: compiling traveler lists
  • iteration
    Simple, 8 & 16-bit complex (and variants), unit choice, store index or pointer
    • CompStats
    • Graphs:
      • creating traveler_lists vector, or getting item count & writing traveler names to last .tmg line
      • HighwaySegment::clinchedby_code
    • SQL: segments table

@yakra

This comment was marked as resolved.

@yakra
Copy link
Owner Author

yakra commented May 14, 2023

HTML

  • TMB_CBC.html
    4db4 store pointer, 2effinline, 4504 "0", 7f87 trav.
    Yes.
  • TMB_TravNums.html
    No-build, contig TLs, t_l as ref, 9fdc trav_nums, bc9d TMB it idx, 0b69 iterator ptr.
    0b69 cherrypicked -> 4db4.
    TBD: Unsure what to do WRT trav_nums yet.
  • TMB_inline.html
    f977 t_l as ref, 2d1b inline mv&e, 4db4 store pointer, 2eff merge, d0fanixtl branch, 13b9 merge.
    Yes.
  • TMB_nixtl.html
    4db4 store pointer, edb4 nix t_l, aecc "0", d0fa notrav, 13b9 inline
    Yes? Contradicts nixtl2.
  • TMB_nixtl2.html
    4504 "0", 13b9 nixtl, 7f87 trav (PreAlpha), 7f7d trav(nixtl)
    No (Contradicts nixtl). May re-examine in future.

@yakra
Copy link
Owner Author

yakra commented May 15, 2023

hidden/deleted branches

comm branch DoWhat comment
ecef ConstFold delete TMBPreAlpha FFWDed
adcf ElStacko hidden
28af add_index hidden
89bd cbt_index delete TMBPreAlpha FFWDed
01d6 reorder delete merged
6da9 alignas hidden
8be1 thread_local hidden re-examine in future
05d7 reserve hidden
660c [stash] hidden meta-iteration (CompStats)
6a83 [stash] hidden meta-iteration (traveler names)
7f7d nixtl hidden re-examine in future
47a7 64-32 delete parent of TMBPreAlpha & nixtl
c310 64-16 hidden
72e2 64-32-16 delete parent of 64-16
fc17 SIO2pun hidden
2d94 ternaries hidden
16bb bit1 delete parent of ternaries
ec4e ComItOpt16 delete parent of ternaries
0b69 return_p hidden
bc9d idx hidden
9fdc nixon delete parent of return_p
f749 ComItOpt delete parent of ternaries
f8cf SimItOpt2 hidden
2cbb SimItOpt hidden
03ec pun delete parent of ternaries
54fb issue588 hidden use this
f28c unit hidden
2d1b inline delete merged
96e9 switch delete merged
7828 cbycode hidden
a423 cbt_br delete merged
e61a TMBitset hidden Nix TL::tn dropped; the rest redone manually downstream

@yakra yakra mentioned this issue May 27, 2023
17 tasks
@yakra
Copy link
Owner Author

yakra commented May 29, 2023

Alpha ToDo list

Solo items

C Descr Comments
54fb issue 588
91a0 !extr include
7ce7 userlog warn
96e9 avoid iterate
ToDo Timestamp for first task (Get list of travelers in the system). Python Too.
ToDo Constexpr the static edge format constants.
v Extraneous else after continue; parens can be clarified
^ extraneous visibility == 1 check just before that

traveled_tmg_line & traveler_lists cleanup

C Descr Comments
4504 "0" HGEdge, HighwaySegment. HG.cpp: 2 tmg_line, 3 size->travnum
7f87 trav

contig RAM group

C Descr Comments
ToDo Get TMArray sorted. TMArray means threaded construction in place (via placement new) without separate read_list function.
a5a7 contig RAM
ToDo fix comment
ToDo TravelerList dtor, to delete traveler_num

TMB group

C Descr Comments
ecff TMB 1st draft Does ctor init vars in order? Don't include siteupdate.cpp changes; see below.
Nope init segments with TravelerList::ids.size() Never mind. TMArray<TravelerList>::size.
a423 branching CBT
acc9 unsigned int
47a7 64-32-bit pun
4db4 iterator ptr
c896 arg order
7415 ctor / init
4a64 integral type
01d6 reorder membr
01e7 cbt_index
48c8 part add_idx
c8b9 const fold
6634 whitespace
9536 constexpr
ecef ST aug index
done nix () and add_value until needed Done as part of c8b9
ToDo Cast (unit)1 before << ...lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch.
ToDo Review TMBitset variable names Also reduce arithmetic in iterator construction.
ToDo Comment for != and |= operators
ToDo segments SQL table: only iterate clinched_by for active/preview systems
ToDo uint8_tetc.
ToDo As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp
ToDo Template
specialization
ToDo Explicit instantiation?

t_l via TMB group

C Descr Comments
d704 t_l via TMB
2d1b inline mv&e Delete that extra line of whitespace from HighwayGraph.h as in 4504
ToDo Don't include TMBitset
4504 "0" GetSubData. HG.cpp: assign traveler numbers
a009 t_l list->vec

@yakra
Copy link
Owner Author

yakra commented Jun 29, 2023

partial specialization experiments
deleted stash @ 484be5074e68f31db790cc53b34cf2494f6ff114

@yakra
Copy link
Owner Author

yakra commented Jul 9, 2023

Benchmark commits:

  • WIP:
    HD @ 59afa75
    UD @ 2126e76
  • Final:
    HD @ 40c2356
    UD @ 5b071e5

@yakra
Copy link
Owner Author

yakra commented Jul 9, 2023

List

Conc
Stat
User

Comp
Graph
Graph_Linux
Total
ErCh
ErCh_zoom

@yakra
Copy link
Owner Author

yakra commented May 20, 2024

deleted branches (BiggaTomato)

g_db

  • g_db comments out DB here and here.
  • g_trav comments out everything specific to traveled graphs in mv&e, write_master and write_sub, except for assigning traveler numbers. HighwayGraph.cpp only.
  • g_cb comments out traveler_lists population
  • 990a288 adds {}// at the beginning of this line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant