Skip to content

TMBitsets for subgraph membership, etc #622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 16, 2024

Conversation

yakra
Copy link
Contributor

@yakra yakra commented May 15, 2024

@jteresco, I'd recommend to be absolutely safe, downloading & doing a MacOS/XCode compile (warnings notwithstanding) before merging this, due to using a few clang builtins.

What's this about?

Taking care of a couple errorcheck items (#614, #621) along the way, the main focus here is implementing some long-planned speed improvements for graph generation.
A couple extra steps in the HighwayGraph ctor populate sets of vertex & edge pointers for each region & system in a subgraph.
When writing the subgraphs, we quickly populate vertex & edge sets via the magic of |= and (for fullcustom graphs) &=.
No more of this clunky iterating through vectors, chasing pointers around, checking membership with ad-hoc bitmasks, and populating more vectors.
Edges in particular benefit. There used to be 3 lists of incident edges for each vertex and 3 vectors of matching edges for each subgraph. Simple, collapsed, traveled. Now, there's 1 vector and 1 TMBitset.
No more redundantly checking an edge for membership 2, 4 or 6 times. It's just there in the TMBitset.
(The redundancy is limited to 2 checks, one for each vertex, back in the ctor when the region/system sets are populated.)
From here, we iterate 1 container, not 3, and write each edge's info to the simple, collapsed, or traveled files per its format bits.
HGVertex & HGedge objects are more lightweight, and they mey get smaller in the future too.

What does it not do?
This doesn't reach the goal of getting FreeBSD to scale gracefully to more threads (next pull request) but still yields a significant additional improvement.

Nota Bene:

In d218c4d, Region::ve_thread introduces some diffs to graphs, omitting 5 points in 4 graphs in 3 regions.
TLDR: When a location at a border contains only devel points in Region A and active/preview points only in Region B:
Before: The vertex is included in A's region graph, and any 2+region graphs that include Region A but not B.
After: The vertex is omitted from these graphs.
More details, including a list of the affected points & graphs, at yakra#265 (comment).
I believe this is The Right Thing To Do.

Misc. facts:

  • I switched to the C++17 standard for a bit when some changes caused some weird behavior by the compiler & linking errors, and added in some C++17 syntactic sugar including structured bindings.
    Later, some last-minute optimizations removed the need to do this, so I reverted those commits, allowing us to continue to build siteupdate using older toolchains.
  • 71 commits (69 if we don't count 2 that only reverted C++17 changes) got squashed down to 9.
  • Lambdas! Lambdas are one of my new favorite things. #617 had the 1st one; the 2nd was in #619.
    This pull request introduces 7 more:
  • TMBitset.cpp has tripled in size (that'd make a good John Lennon parody), 87 -> 261 lines.
  • Getting all the HGEdge objects into contiguous storage ended up a lot simpler than originally conceptualized. No need to make new objects on the heap & come up with clever ways to track & delete them if I can just emplace them in a temporary container like a std::list which goes out of scope at the end of the HighwayGraph ctor. I ultimately decided to do away with the temporary container and moves entirely, and just store all the edges, including the "dead" ones, in a TMArray. The cost of copying all the edges & then destroying the temporary container outweighed the benefit of keeping the HGEdge TMBitsets absolutely minimized and free of extraneous 0s. TMBitset iteration is pretty efficient now, and the extra 0s don't hurt much.
  • I considered sorting edges after construction in order to defragment their TMBitsets, but that took up more time (0.7 s on BiggaTomato) than it saved.
    Skipping the sorting optimizes for speed though not RAM use, and leaves open the possibility of finding clever ways to reorder edge construction in the future.

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Setting up for graphs of highway data

Starting at the start of graph generation....

Setup
Very small changes here; you can see we're near the margin of error, especially looking at lab2.

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Creating vertices and region & system vertex (and edge) sets

Vtx
Creating unique names and vertices, on its, own, gets a little faster.
No calls to add_vertex, none of the vector appends & mutexes involved.
(Minor changes to HighwayGraph::simplify could in theory make it a little faster though in practice it's near margin of error.)

However, this missing functionality has just been moved elsewhere.
Creating edge and vector sets now has separate multithreaded tasks for regions & systems.

rtmb
htmb

  • The dotted lines represent a more apples-to-apples comparison, an earlier stage of development since squashed into d218c4d. Here, like no-build (fd96), only sets of vertices get populated.
  • Edges take more time to do. We iterate through the incident edges of each vertex (several per), which means checking & adding each vertex twice, once for each vertex. Then we have to check each edge to ensure it's still in the right region/system, iterating through its concurrent segments if applicable. ...Now that I think of it, there may be a way to speed this up.

VtxAll_6928

  • Add in 6928's time for building only vertex sets and we can see that while there's more noise at more threads, with fewer threads we're still the tiniest bit ahead of the game.
    Except on bsdlab, where there's a more visible improvement.
  • Nota bene: This doesn't include calling TMBitset::shrink_to_fit(), which may or may not push us a bit above no-build.
    I don't expect it has very much impact.

VtxAll
For comparison, here's the current combined time of all 3 tasks; creating vertices, region vertex & edge sets, and system vertex & edge sets.
You can see that with populating the edge sets, we're taking up more time now, but this is the upfront cost we pay for more speed when generating subgraphs.
Not the most relevant chart, though. We're measuring the sum of unrelated tasks. There were no per-region/system edge sets or vectors in fd96ac6.
Another way to see the impact of the edge sets is in the height difference between lines in per-region & per-system charts above.

Now let's rewind a bit for a fun little curiosity of vertex-only sets.
Recall the no-build creates HGVertex* vectors as opposed to the TMBitsets of 6928.
Let's look at Phases 1 & 2 of threaded set population.

  • 2f63 = threaded population of vectors, vertices only.
  • 6928 = threaded population of TMBitsets, vertices only.

rtmb_2f63
Regions:

  • On the low end, vectors are more efficient.
    This makes sense, as adding a value to a TMBitset involves more arithmetic, bit-shifting & such.
  • On the high end, TMBitsets are more efficient.
    This makes sense, as resizing a vector means memory copies & reallocations, which can add up.

htmb_2f63
Whereas for systems, vectors are faster across the board.
Hm. Far fewer systems (only those listed in a .csv as being in a subgraph) than regions get sets made, but that shouldn't matter; there will still be just as many threads doing copies & reallocations.
I can only speculate here, but maybe the time spent iterating through concurrent segments eases up the load on the memory bus just enough? *Shrug.*

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Edge construction

Edge1
The simple constructor is, well, simpler.

  • No more allocating & initializing the written bools.
  • With each vertex now only having one container of incident edges
    rather than 3, we only push this edge to 2 containers rather than 6.
  • Said container is now a std::vector rather than a std::list; that's worth something.

Edge2
The collapse ctor has even more improvements:

Back at the HighwayGraph ctor, the collapse condition (whether to do 1 dual-format collapse or 2 independent single-format collapses) is much simpler (yakra#269, yakra#95).
We get rid of some extraneous pointer-chasing too.

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Writing graph files

And now, the good stuff!

Graphs
OK, so FreeBSD is a little bit faster much but not all of the time.
Maybe we're starting to get somewhere, but overall not much improvement. ...Yet.
The good news is, I've found the magic bullet; the next pull request will have bsdlab right down there hanging out with the Linux machines.

Graphs_Linux
Zoomed in to just the Linux machines.

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Reading subgraph descriptions

Saving this for last (well, out of the graph-related charts anyway) because it's the least. :)

SubCsv
Yes, the Y-axis is in milliseconds. The bars on toma & lab3 suggest we're near margin of error.

A few oddities & observations about the machines' relative performance:

  • CentOS performs way worse than Ubuntu; lab3 & lab4 have 100% the same hardware.
    Lack of small string optimization, maybe?
  • lab1 takes more time than lab2, despite its higher clock speed. I don't think its slower HD is a factor, as the benchmarks are done after a "dry run" of siteupdate that reads all the data it needs to into disk cache. Subsequent runs are much faster, in particular reading .wpt files. And I don't hear the HDs chugging away.
    More likely, its 2 RAM channels, as opposed to 3, are a factor.
  • Contrast BiggaTomato (BT), which has an ancient HD and 2 channels of DDR2-800 (vice DDR3-1600) and performs way better because Ubuntu.
  • I don't think bsdlab's SSD is a factor either.
    The only other hardware difference from lab3 & 4 is its RAM chips; they may have different timings.
    I wanna say the difference here is FreeBSD is just that much better at what it's doing.

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Computing stats

One more last-minute pair of charts.

Stat
Computing stats also benefits from using __builtin_ctzl for TMBitset iteration.
...Ever so slightly.

Stat_log
More visible if I use a logarithmic scale? Maybe?

@yakra
Copy link
Contributor Author

yakra commented May 15, 2024

Ready to rock!

@jteresco
Copy link
Contributor

Wow, this all looks great as usual. Looks like you want me to try it out on my Mac. Just see if it compiles OK, or do you want me to run some tests?

@jteresco
Copy link
Contributor

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

I'll merge and use in the next site update.

@jteresco jteresco merged commit 52ef723 into TravelMapping:master May 16, 2024
@jteresco
Copy link
Contributor

This is on the production siteupdate instance (in my account) and on the production datacheck instance (in /fast/tm).

Thanks again!

@yakra
Copy link
Contributor Author

yakra commented May 17, 2024

Thanks for merge & testing! Just a compile was all I was worried about; good to see that works.

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

That's got to all be in reading .wpt files, with the data all cached on the 2nd run. ;)

Next pull request will affect --errorcheck, speeding up userlogs.
At some point after that, stats CSVs.

@yakra yakra deleted the gtmb_rebase_5 branch May 17, 2024 00:47
@jteresco
Copy link
Contributor

Sure enough - both are about 5.5 second on second and third runs once things are cached.

@michihdeu
Copy link
Contributor

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

I run data check (/fast/tm) yesterday just after you wrote that. It took 12 seconds. I run it again now, and it was 14 seconds on first run and 12s on 2nd to 4th run. I wonder if it was already in place on /fast/tm when I run it yesterday?
btw, it's under 1s for railways

@yakra yakra mentioned this pull request May 19, 2024
@yakra
Copy link
Contributor Author

yakra commented May 19, 2024

^ I think the 5.5s figure is for Jim's M1 Mac rather than noreaster.

No surprise for railways; much smaller dataset! 😀

@yakra yakra mentioned this pull request May 23, 2024
@yakra yakra mentioned this pull request Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment