Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes yakra#127.
Closes yakra#147.
Memory bandwidth
We can avoid a lot of expensive string construction, reconstruction, and memory copies by just writing the individual components of a .tmg edge line directly to the file, rather than concatenating them together into one big long std::string. This principle extends to edge labels as well, though the effects aren't as visible.
This provides a modest boost in performance at any number of threads, 5-10% range. The benefit is more pronounced when combined with...
Raw speed solutions
This is where things get interesting.
HighwayGraph::matching_vertices_and_edges
compute sets of matching edges is overkill. Instead, store edges as a list, and avoid adding edges > once by using the*_written
bools already used in master graphs; this works for subgraphs too.Put all of these together, and whoa mama. 70% improvement @ 1 thread on most lab machines. 98% on BiggaTomato. Improvements of ~40-60% are common even up to ~4-6 threads.
The selected alternative is etF3, the dark purple line. It scales well to a large number of threads, hitting lab2's sweet spot at 7-8, and doesn't break full custom graph support, leaving our options open in the future.
Lab3 has the least memory bandwidth divided by the most cores. Compare how the different alternatives stack up at 2, 3 or 5 threads vs. how they stack up at 15 or 18.
Newcomer lab4 has the same hardware as lab3, running Ubuntu instead of CentOS. Ubuntu works more efficiently at a higher # of threads.
Finally, a traditional wall time chart of the selected alternative vs. the old version.