TMBitsets for subgraph membership, etc #622

yakra · 2024-05-15T04:06:30Z

Closes __builtin_ctz (count trailing zeros) #616.
Closes System abbreviation error output #621.
Closes devel systems in subgraph CSVs #614.
Closes Edge compression: extraneous stuff yakra/DataProcessing#269.
Closes collapsed=traveled edge check, compressed edge ctor yakra/DataProcessing#95.
Closes Region & System vertex population yakra/DataProcessing#265.
Closes PlaceRadius: use do-while yakra/DataProcessing#254.
Closes TMBitset<HGVertex*> yakra/DataProcessing#251.

@jteresco, I'd recommend to be absolutely safe, downloading & doing a MacOS/XCode compile (warnings notwithstanding) before merging this, due to using a few clang builtins.

What's this about?

Taking care of a couple errorcheck items (#614, #621) along the way, the main focus here is implementing some long-planned speed improvements for graph generation.
A couple extra steps in the HighwayGraph ctor populate sets of vertex & edge pointers for each region & system in a subgraph.
When writing the subgraphs, we quickly populate vertex & edge sets via the magic of |= and (for fullcustom graphs) &=.
No more of this clunky iterating through vectors, chasing pointers around, checking membership with ad-hoc bitmasks, and populating more vectors.
Edges in particular benefit. There used to be 3 lists of incident edges for each vertex and 3 vectors of matching edges for each subgraph. Simple, collapsed, traveled. Now, there's 1 vector and 1 TMBitset.
No more redundantly checking an edge for membership 2, 4 or 6 times. It's just there in the TMBitset.
(The redundancy is limited to 2 checks, one for each vertex, back in the ctor when the region/system sets are populated.)
From here, we iterate 1 container, not 3, and write each edge's info to the simple, collapsed, or traveled files per its format bits.
HGVertex & HGedge objects are more lightweight, and they mey get smaller in the future too.

What does it not do?
This doesn't reach the goal of getting FreeBSD to scale gracefully to more threads (next pull request) but still yields a significant additional improvement.

Nota Bene:

In d218c4d, Region::ve_thread introduces some diffs to graphs, omitting 5 points in 4 graphs in 3 regions.
TLDR: When a location at a border contains only devel points in Region A and active/preview points only in Region B:
• Before: The vertex is included in A's region graph, and any 2+region graphs that include Region A but not B.
• After: The vertex is omitted from these graphs.
More details, including a list of the affected points & graphs, at yakra#265 (comment).
I believe this is The Right Thing To Do.

Misc. facts:

I switched to the C++17 standard for a bit when some changes caused some weird behavior by the compiler & linking errors, and added in some C++17 syntactic sugar including structured bindings.
Later, some last-minute optimizations removed the need to do this, so I reverted those commits, allowing us to continue to build siteupdate using older toolchains.
71 commits (69 if we don't count 2 that only reverted C++17 changes) got squashed down to 9.
Lambdas! Lambdas are one of my new favorite things. #617 had the 1st one; the 2nd was in #619.
This pull request introduces 7 more:
- one to populate vertex & edge TMBitsets from a PlaceRadius
- one to detach temporary partially-collapsed edges from vertices during edge compression
- ...Then there's the new optional TMBitset logs. 5 new lambdas here, including 2 lambdas within a lambda because niram has only one edge in the entire system.
TMBitset.cpp has tripled in size (that'd make a good John Lennon parody), 87 -> 261 lines.
Getting all the HGEdge objects into contiguous storage ended up a lot simpler than originally conceptualized. No need to make new objects on the heap & come up with clever ways to track & delete them if I can just emplace them in a temporary container like a std::list which goes out of scope at the end of the HighwayGraph ctor. I ultimately decided to do away with the temporary container and moves entirely, and just store all the edges, including the "dead" ones, in a TMArray. The cost of copying all the edges & then destroying the temporary container outweighed the benefit of keeping the HGEdge TMBitsets absolutely minimized and free of extraneous 0s. TMBitset iteration is pretty efficient now, and the extra 0s don't hurt much.
I considered sorting edges after construction in order to defragment their TMBitsets, but that took up more time (0.7 s on BiggaTomato) than it saved.
Skipping the sorting optimizes for speed though not RAM use, and leaves open the possibility of finding clever ways to reorder edge construction in the future.

...to be implemented 4 commits in the future.

to defragment TMBitsets & reduce heap footprint by a factor of: ~10.4 for regions ~3.52 for systems

should have happened as part of a1d59f8

yakra · 2024-05-15T04:08:46Z

Setting up for graphs of highway data

Starting at the start of graph generation....

Very small changes here; you can see we're near the margin of error, especially looking at lab2.

897c was earlier in development and got rebased; it's equivalent to d218c4d.
We're no longer setting up the ad hoc bitfield for vertex membership.
By the 2nd-most-recent commit, we're passing around & incrementing another variable, and placing larger structures into the hi_ & lo_priority_points vectors.

yakra · 2024-05-15T04:26:14Z

Creating vertices and region & system vertex (and edge) sets

Creating unique names and vertices, on its, own, gets a little faster.
No calls to add_vertex, none of the vector appends & mutexes involved.
(Minor changes to HighwayGraph::simplify could in theory make it a little faster though in practice it's near margin of error.)

However, this missing functionality has just been moved elsewhere.
Creating edge and vector sets now has separate multithreaded tasks for regions & systems.

The dotted lines represent a more apples-to-apples comparison, an earlier stage of development since squashed into d218c4d. Here, like no-build (fd96), only sets of vertices get populated.
Edges take more time to do. We iterate through the incident edges of each vertex (several per), which means checking & adding each vertex twice, once for each vertex. Then we have to check each edge to ensure it's still in the right region/system, iterating through its concurrent segments if applicable. ...Now that I think of it, there may be a way to speed this up.

Add in 6928's time for building only vertex sets and we can see that while there's more noise at more threads, with fewer threads we're still the tiniest bit ahead of the game.
Except on bsdlab, where there's a more visible improvement.
Nota bene: This doesn't include calling TMBitset::shrink_to_fit(), which may or may not push us a bit above no-build.
I don't expect it has very much impact.

For comparison, here's the current combined time of all 3 tasks; creating vertices, region vertex & edge sets, and system vertex & edge sets.
You can see that with populating the edge sets, we're taking up more time now, but this is the upfront cost we pay for more speed when generating subgraphs.
Not the most relevant chart, though. We're measuring the sum of unrelated tasks. There were no per-region/system edge sets or vectors in fd96ac6.
Another way to see the impact of the edge sets is in the height difference between lines in per-region & per-system charts above.

Now let's rewind a bit for a fun little curiosity of vertex-only sets.
Recall the no-build creates HGVertex* vectors as opposed to the TMBitsets of 6928.
Let's look at Phases 1 & 2 of threaded set population.

2f63 = threaded population of vectors, vertices only.
6928 = threaded population of TMBitsets, vertices only.

Regions:

On the low end, vectors are more efficient.
This makes sense, as adding a value to a TMBitset involves more arithmetic, bit-shifting & such.
On the high end, TMBitsets are more efficient.
This makes sense, as resizing a vector means memory copies & reallocations, which can add up.

Whereas for systems, vectors are faster across the board.
Hm. Far fewer systems (only those listed in a .csv as being in a subgraph) than regions get sets made, but that shouldn't matter; there will still be just as many threads doing copies & reallocations.
I can only speculate here, but maybe the time spent iterating through concurrent segments eases up the load on the memory bus just enough? *Shrug.*

yakra · 2024-05-15T04:35:22Z

Edge construction

The simple constructor is, well, simpler.

No more allocating & initializing the written bools.
With each vertex now only having one container of incident edges
rather than 3, we only push this edge to 2 containers rather than 6.
Said container is now a std::vector rather than a std::list; that's worth something.

The collapse ctor has even more improvements:

Again, no setting up written.
No more conditional branching to find edge1 and edge2; these are passed in from the HighwayGraph ctor now.
With 1 incident edge vector per vertex now, we push this edge to exactly 2 containers rather than potentially 4, and lose some conditional branching.
HGEdge::detach is greatly simplified.
- We erase the edge from 2 containers instead of up to 6.
- written doesn't exist anymore. No resources to free.
- It gets called less often. We only need to worry about erasing an edge from the one HGVertex::incident_edges when it "dies", when all its format bits are cleared.

Back at the HighwayGraph ctor, the collapse condition (whether to do 1 dual-format collapse or 2 independent single-format collapses) is much simpler (yakra#269, yakra#95).
We get rid of some extraneous pointer-chasing too.

yakra · 2024-05-15T04:37:03Z

Writing graph files

And now, the good stuff!

OK, so FreeBSD is a little bit faster much but not all of the time.
Maybe we're starting to get somewhere, but overall not much improvement. ...Yet.
The good news is, I've found the magic bullet; the next pull request will have bsdlab right down there hanging out with the Linux machines.

Zoomed in to just the Linux machines.

yakra · 2024-05-15T04:41:08Z

Reading subgraph descriptions

Saving this for last (well, out of the graph-related charts anyway) because it's the least. :)

Yes, the Y-axis is in milliseconds. The bars on toma & lab3 suggest we're near margin of error.

e7b1 was earlier in development and got rebased; it's equivalent to e46619a.
This uses HighwaySystem::sysname_hash.
As of 1350045, the subgraph region & system lists are replaced with vectors.

A few oddities & observations about the machines' relative performance:

CentOS performs way worse than Ubuntu; lab3 & lab4 have 100% the same hardware.
Lack of small string optimization, maybe?
lab1 takes more time than lab2, despite its higher clock speed. I don't think its slower HD is a factor, as the benchmarks are done after a "dry run" of siteupdate that reads all the data it needs to into disk cache. Subsequent runs are much faster, in particular reading .wpt files. And I don't hear the HDs chugging away.
More likely, its 2 RAM channels, as opposed to 3, are a factor.
Contrast BiggaTomato (BT), which has an ancient HD and 2 channels of DDR2-800 (vice DDR3-1600) and performs way better because Ubuntu.
I don't think bsdlab's SSD is a factor either.
The only other hardware difference from lab3 & 4 is its RAM chips; they may have different timings.
I wanna say the difference here is FreeBSD is just that much better at what it's doing.

yakra · 2024-05-15T04:42:30Z

Computing stats

One more last-minute pair of charts.

Computing stats also benefits from using __builtin_ctzl for TMBitset iteration.
...Ever so slightly.

More visible if I use a logarithmic scale? Maybe?

yakra · 2024-05-15T04:44:17Z

Ready to rock!

jteresco · 2024-05-16T19:25:37Z

Wow, this all looks great as usual. Looks like you want me to try it out on my Mac. Just see if it compiles OK, or do you want me to run some tests?

jteresco · 2024-05-16T19:29:45Z

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

I'll merge and use in the next site update.

jteresco · 2024-05-16T23:18:38Z

This is on the production siteupdate instance (in my account) and on the production datacheck instance (in /fast/tm).

Thanks again!

yakra · 2024-05-17T00:43:57Z

Thanks for merge & testing! Just a compile was all I was worried about; good to see that works.

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

That's got to all be in reading .wpt files, with the data all cached on the 2nd run. ;)

Next pull request will affect --errorcheck, speeding up userlogs.
At some point after that, stats CSVs.

jteresco · 2024-05-17T02:21:35Z

Sure enough - both are about 5.5 second on second and third runs once things are cached.

michihdeu · 2024-05-17T17:34:29Z

It compiles, it runs datacheck. 7.2 seconds before, 5.5 seconds after.

I run data check (/fast/tm) yesterday just after you wrote that. It took 12 seconds. I run it again now, and it was 14 seconds on first run and 12s on 2nd to 4th run. I wonder if it was already in place on /fast/tm when I run it yesterday?
btw, it's under 1s for railways

yakra · 2024-05-19T21:07:27Z

^ I think the 5.5s figure is for Jim's M1 Mac rather than noreaster.

No surprise for railways; much smaller dataset! 😀

yakra added 9 commits May 7, 2024 16:42

timestamp for end of HighwayGraph ctor

061aa42

new TMBitset functionality for subgraph gen

a5c60b8

...to be implemented 4 commits in the future.

duplicate systemname errorcheck (TravelMapping#621)

4ae8f8c

disallow devel systems in subgraph CSVs (TravelMapping#614)

e46619a

GLE region/system lists -> vectors

1350045

TMBitsets for subgraph membership

d218c4d

vertex/edge TMBitset RAM use logs

2634532

optimize vertex ordering

86cfdb3

to defragment TMBitsets & reduce heap footprint by a factor of: ~10.4 for regions ~3.52 for systems

remove extraneous include

3282921

should have happened as part of a1d59f8

yakra added graph generation datacheck speed CSVs multithreading labels May 15, 2024

jteresco merged commit 52ef723 into TravelMapping:master May 16, 2024

yakra deleted the gtmb_rebase_5 branch May 17, 2024 00:47

yakra mentioned this pull request May 19, 2024

#585 phase 1 #623

Merged

yakra mentioned this pull request May 23, 2024

Graph writing speeds #626

Closed

yakra mentioned this pull request Jun 4, 2024

{fmt}: #585 phase 1 take 2 #628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TMBitsets for subgraph membership, etc #622

TMBitsets for subgraph membership, etc #622

Uh oh!

yakra commented May 15, 2024 •

edited

Loading

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

yakra commented May 15, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

yakra commented May 17, 2024

Uh oh!

jteresco commented May 17, 2024

Uh oh!

michihdeu commented May 17, 2024

Uh oh!

yakra commented May 19, 2024

Uh oh!

Uh oh!

TMBitsets for subgraph membership, etc #622

TMBitsets for subgraph membership, etc #622

Uh oh!

Conversation

yakra commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's this about?

Nota Bene:

Misc. facts:

Uh oh!

yakra commented May 15, 2024

Setting up for graphs of highway data

Uh oh!

yakra commented May 15, 2024

Creating vertices and region & system vertex (and edge) sets

Uh oh!

yakra commented May 15, 2024

Edge construction

Uh oh!

yakra commented May 15, 2024

Writing graph files

Uh oh!

yakra commented May 15, 2024

Reading subgraph descriptions

Uh oh!

yakra commented May 15, 2024

Computing stats

Uh oh!

yakra commented May 15, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

jteresco commented May 16, 2024

Uh oh!

yakra commented May 17, 2024

Uh oh!

jteresco commented May 17, 2024

Uh oh!

michihdeu commented May 17, 2024

Uh oh!

yakra commented May 19, 2024

Uh oh!

Uh oh!

yakra commented May 15, 2024 •

edited

Loading