Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ subgraph generation speed #404

Merged
merged 5 commits into from
Jan 31, 2021
Merged

C++ subgraph generation speed #404

merged 5 commits into from
Jan 31, 2021

Conversation

yakra
Copy link
Contributor

@yakra yakra commented Jan 24, 2021

Closes yakra#127.
Closes yakra#147.

Memory bandwidth

We can avoid a lot of expensive string construction, reconstruction, and memory copies by just writing the individual components of a .tmg edge line directly to the file, rather than concatenating them together into one big long std::string. This principle extends to edge labels as well, though the effects aren't as visible.
This provides a modest boost in performance at any number of threads, 5-10% range. The benefit is more pronounced when combined with...

Raw speed solutions

This is where things get interesting.

  • Edges: Having HighwayGraph::matching_vertices_and_edges compute sets of matching edges is overkill. Instead, store edges as a list, and avoid adding edges > once by using the *_written bools already used in master graphs; this works for subgraphs too.
  • Travelers: Same idea when finding travelers for each graph.
  • Vertices: We can also do this when combining vertex lists in multiregion & multisystem graphs. However, it breaks the implemented-but-unused code for full custom graphs, forcing us to find a new solution.

Put all of these together, and whoa mama. 70% improvement @ 1 thread on most lab machines. 98% on BiggaTomato. Improvements of ~40-60% are common even up to ~4-6 threads.

  • However: This all requires more memory bandwidth. (Without the memory bandwidth solution above, all these changes together perform worse than no-build @ 9+ threads on lab3.) We start hitting a RAM bottleneck, with more muted performance improvements beyond ~7 threads.
  • Compromise: We can trade off some raw speed for some more RAM efficiency, allowing the algorithm to more gracefully scale to a larger number of threads. Thus we get higher overall efficiency, less wall time, at a higher number of threads.
  • Convert over any 1 or 2 of vertices/edges/travelers, whatever produces the best results.

Subg_RE-2
The selected alternative is etF3, the dark purple line. It scales well to a large number of threads, hitting lab2's sweet spot at 7-8, and doesn't break full custom graph support, leaving our options open in the future.


Subg_RE-3
Lab3 has the least memory bandwidth divided by the most cores. Compare how the different alternatives stack up at 2, 3 or 5 threads vs. how they stack up at 15 or 18.


Subg_RE-4
Newcomer lab4 has the same hardware as lab3, running Ubuntu instead of CentOS. Ubuntu works more efficiently at a higher # of threads.


Subg
Finally, a traditional wall time chart of the selected alternative vs. the old version.

reduce string copies when writing graphs (function)

allocate fstr once and pass, rather than reallocate thousands of times

edge labels: instead of std::string construction & copying, just insert its components into the ofstream
TMGf3 bugfix
t
C++ traveler matching: sets -> lists & bools
e
C++ edge matching: sets -> lists & bools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

subgraph speed/scaling alternatives .tmg edge lines
2 participants