Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subgraph speed/scaling alternatives #147

Closed
yakra opened this issue Jan 12, 2021 · 3 comments · Fixed by TravelMapping/DataProcessing#404
Closed

subgraph speed/scaling alternatives #147

yakra opened this issue Jan 12, 2021 · 3 comments · Fixed by TravelMapping/DataProcessing#404

Comments

@yakra
Copy link
Owner

yakra commented Jan 12, 2021

Memory bandwidth alternatives

short name commit parent diff description / comments
TMGi bfe86f0 fc400d9 fixed reduce string copies when writing graphs (inline)
Components written directly to file by HighwayGraph::write_*graphs_tmg instead of concatenating into a long std::string
*_tmg_line functions eliminated completely
TMGf 950400c fc400d9 OK reduce string copies when writing graphs (function)
Components written directly to file by HighwayGraph::write_*graphs_tmg instead of concatenating into a long std::string
*_tmg_line functions retained
TMGf2 54e5a7d TMGf OK allocate fstr once and pass, rather than reallocate thousands of times
TMGf3 f93ed18 TMGf2 fixed edge labels: instead of std::string construction & copying, just insert its components into the ofstream

CPU time alternatives

short name commit parent diff description / comments
v ToDo f181bf5 sets -> lists & bools: vertices
e bb4e450 f181bf5 OK sets -> lists & bools: edges
(shortname formerly hge)
t ToDo f181bf5 sets -> lists & bools: travelers
ve 22cf13b 1c204a8
rebase?
OK sets -> lists & bools: vertices, edges
(shortname formerly mv)
vt ToDo f181bf5 sets -> lists & bools: vertices, travelers
et ToDo f181bf5 sets -> lists & bools: edges, travelers
vet 77af90e 22cf13b OK sets -> lists & bools: vertices, edges, travelers
(shortname formerly mt)

Rubbish bin (epoch unless otherwise noted)

old TMGi was 9cdc37d
old TMGf3 was 9bcee48
old TMGr was 48797e1: old TMGf3 (9bcee48) rebased onto vet (77af90e)
TMGi bugfix (0993a1b on epoch) squashed
TMGf3 bugfix (e73a1bf on epoch) squashed
TMGr bugfix (c1d7eaf on epoch) squashed
etF: cherrypick e before rebasing onto TMGf.squashed: df4567a
etF before interactive rebase (2nd cherrypicked copy of unused verte3x vars removal): 6038cf3
Python attempt at 90edd7a on BiggaTomato. Very little difference (~2.3s?), probably just noise. Going no-build.


Mix & Match

  no-build v e t ve vt et vet
no-build fc4009d c200686 bb4e450 7a67ce4 22cf13b e3e708e a08dd6c 77af90e
i bfe86f0 x x x x x x x
i3 cd244c5              
f1 950400c 8c10540 661a59b a8b269a 584c37b 27f909f 3e41a76 824e2b8
f2 54e5a7d 1afe9a8 758f082 dcfb20d aedc633 d2b66ce c87a8e2 ab9c7a5
f3 f93ed18 3b94f38 34aecf6 f5fb4c1 f38426d 94130de 9f33c1a 11a9f42
cd ytm/DataProcessing/siteupdate/cplusplus/
git fetch
git checkout c200686; make siteupdate; mv siteupdate siteupdate_v
git checkout 7a67ce4; make siteupdate; mv siteupdate siteupdate_t
git checkout e3e708e; make siteupdate; mv siteupdate siteupdate_vt
git checkout a08dd6c; make siteupdate; mv siteupdate siteupdate_et
git checkout 3b94f38; make siteupdate; mv siteupdate siteupdate_vF3
git checkout f5fb4c1; make siteupdate; mv siteupdate siteupdate_tF3
git checkout 94130de; make siteupdate; mv siteupdate siteupdate_vtF3
git checkout 9f33c1a; make siteupdate; mv siteupdate siteupdate_etF3

# _v _t _vt _et _vF3 _tF3 _vtF3 _etF3
@yakra
Copy link
Owner Author

yakra commented Jan 16, 2021

Alternative selection

RAM

  • TMGi is out. On lab1, efficiency drops off sharply above 1 thread. At 3+ threads, performs worse than no-build. A little surprising considering the memory bandwidth improvement, but I've seen this kind of behavior when inlining code vs keeping it off in a separate function before, strange as it may be.
  • TMGi3 is out. On lab1, makes up for the losses of TMGi and performs comparably to TMGf* until efficiency drops off >5 threads. At 7+ threads, performs worse than no-build. Makes sense to not include anything based off TMGi.
  • No-build is out. TMGf* perform better without exception.
  • TMGf3 preferred. TMGf* perform comparably across the board, with no clear winner. With the reduction in string construction, concatenation & memory copies, it makes sense that TMGf3 would make better use of memory bandwidth, even if the effect is small & often lost in the noise. TMGf3's performance was often, if not always, at or near the top of the pack. Call it an unclear winner.
  • I may revisit this after selecting a CPU alternative (or narrowing the field at least), and see if differences become more apparent when memory bandwidth is at more of a premium.

CPU

  • No-build is out. Only faster in 2 instances: BiggaTomato, vet vs vetF3, @ 2 (23.13 vs 23.15 s) & 3 (17.03 vs 17.53 s) threads. This appears to just be noise. We probably just don't have enough threads for the bandwidth savings to really kick in. Otherwise, on all machines but lab3, all build alternatives are faster. On lab3, t performs better even with RAM @ no-build. Otherwise, all build alternatives perform better across the board except for vetF3 and veF3.
  • vF3 is out. eF3 is always faster on all machines except lab3, where...
    • eF3 is faster @ 1-14 threads and again @ 24 threads
    • tF3 is faster @ 5-23 threads, except @ 15.
    • vF3 is indeed fastest @ 15 threads, but only beats tF3 by 0.11 s. (Noise?) But, why even run @ 15 threads? Sure, the fastest recorded single total run time was @ 15 threads, but with veF3, and it was a bit of an outlier. We're better off looking at the average times across all 10 passes. Wanna optimize for fastest average subgraph speed? vtF3 @ 8 threads; 11.92 s. Wanna optimize for average total run time? eF3 @ 24 threads; 51.35 s. Don't wanna use all 24 threads? 2nd best is eF3 @ 12 threads; 51.52 s. Let's call eF3 superior to vF3.
  • tF3 is out, even if it takes a clear lead on lab3 from 11-22 (except @ 15) threads. It looks good, but there's no reason to bother.
    • As noted above, whether optimizing for subgraph or overall speed, we'd want a different alternative @ a different # threads.
    • Yes, tF3 did measure faster than eF3 at 9-12 threads, but this is noise: eF3 performed better than tF3 when we look at total run time, not just subgraphs.
    • Up to 8 threads, the advantages of other alternatives are too big to ignore.
    • On lab2 @ 11 threads, tF3 leads eF3 by a whole 0.06 s. Again, doesn't matter; there's more efficiency to be had at a different # threads; eF3, veF3, vtF3 & etF3 are all faster whether optimizing for subgraph or total run time.
    • Other machines, only on lab1 @ 8 threads is tF3 not dead last; it leads vet3 by 0.16 s. Aside from my suspicions that OS moreso than chipset affects scaling to more threads (how would FreeBSD do?), vet3 may actually be less of a priority anyway. More on this below...
  • veF3 is out. Peak efficiency (whether subgraph or total) never happens here. If considering using this, may as well use vetF3.
    • Low # threads: inferior to vetF3.
    • Mid-high # threads on lab3: comparable to vetF3, inferior to all remaining partial-build alternatives.
    • lab2: mid = comparable / inferior; high = comparable/comparable
    • lab1.5: everything is low # threads. :)
  • vetF3 & vtF3 deferred. Changing vertex sets to lists would require changing set intersections for full custom graphs. Build sets as needed, iterate through lists & check membership, whatever... Kicking this can down the road lets me more easily make a decision now.
    • OK, so vtF3 yield's lab3's peak efficiency by 0.61 s, and outperforms eF3 and even etF3 on BiggaTomato. But neither of these are production environments; lab2 (and almost certainly noreaster) don't suffer from these bandwidth limitations. On lab2, etF3 looks to have a pretty clear lead in the "sweet spot", though by the time we take all tasks into account, it gets lost in the noise.
    • vetF3 takes 1st place on lab3 from 1-6 threads, but lags eF3, vtF3 & etF3 @ 7+. On lab2, vetF3 ramps down the fastest thru the transition zone, trailing all partial-build alternatives except vF3 & tF3 @ 8 threads. More comparable to the rest of the pack @ 10-11 threads+. On every other machine, vetF3 is the subgraph efficiency champ. Makes me a little sad to not select this alternative, but I may revisit it again later.
  • etF3 selected. With only eF3 otherwise remaining, etF3 keeps a clear lead at lower # of cores, and on lab2, edges out eF3 for the overall efficiency sweet spot (whether subgraph or total).

@yakra
Copy link
Owner Author

yakra commented Jan 24, 2021

lab4

exists now. Same hardware as lab3, running Ubuntu instead of CentOS.
Overall, everything scales to more threads much more gracefully.

Does this change anything?

RAM

  • Same conclusion.

CPU

  • No-build is still out. Performs better than CentOS and closes the gap to the build options, but is still pretty clearly inferior. This seems to be in line with Ubuntu making better use of memory bandwidth overall, and scaling to higher thread counts.
  • vF3 is still out. Aside from a few tiny blips @ 20+ threads, this is the worst-performing build alternative.
  • tF3 is still out. In contrast to lab3, this is the worst-performing remaining build alternative, except for vetF3 @ 16 & 21 threads.
  • eF3 is out now. While it was the last alternative eliminated on lab3, here it's clearly slower than etF3 except @ 19 (by the tiniest of margins) & 23 threads.
  • vtF3 is out. Looking at the Relative Efficiency chart in LibreOffice, it's essentially interchangeable with _etF3. That one takes the lead overall by 0.21s; this one takes the lead for subgraphs by 0.04s. Between the two, _etF3 won't disrupt full custom graphs, and performs markedly better with a low number of threads in CentOS.
  • veF3? There's more of an argument to keep it now. It's the peak efficiency winner for both Subgraph (10.66000s at 9 threads) and overall time (42.860s at 20 threads), though the latter may be just noise. Other than @ 20 threads, trails _etF3 @ 10+ threads.
  • vetF3 remains competitive thru the sweet spot and on to the end. Takes 2nd place for both Subgraph (10.84000s at 9 threads) and overall time (43.610s at 19 threads). This may be the best choice for noreaster, as I expect overall efficiency to top out at or below 8 threads, due to the number of other processes competing for RAM bandwidth and L3 cache.
  • _etF3 remains a strong contender for "sweet spot" and overall crown on CentOS. Going no-build on vertices avoids breaking full custom graph compatibility, and I won't have to lock myself into or out of any particular approach. SELECTED. I can always revisit this in the future, after implementing full custom graphs.

@yakra
Copy link
Owner Author

yakra commented Jan 31, 2021

Reopen and comment test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant