Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMArray, TMBitset, etc. #592

Merged
merged 12 commits into from
Jul 9, 2023
Merged

TMArray, TMBitset, etc. #592

merged 12 commits into from
Jul 9, 2023

Conversation

yakra
Copy link
Contributor

@yakra yakra commented Jul 9, 2023

I believe this is all ready to rock. But if you feel like waiting a few days before merging it all in, no harm there.

This does not address the Makefile issues. I want to give that some more thought.
You can make clean, or if you don't want to recompile everything, rm threads/threads.d classes/Route/read_wpt.d (from the cplusplus/ dir) should be sufficient.

What is included?

Lower impact items:

Big-ticket items:

It's possible this closes some older more obscure issues too (probably in yakra/DataProcessing rather than TravelMapping/DataProcessing). If so I'll close them manually as I come across them.

yakra added 12 commits May 28, 2023 22:17
guard against devel systems @ front of concurrency list

54fbe97, soft reset, raise exception, comment -> docstring
cherrypick 91a073d
cherrypick 96e94d4 & reorder traveler mileage additions.

The upcoming conversion to TMBitset will introduce some overhead to iterating an empty container (specifically, finding the value of begin()).
No diff to binary, but it's The Right Thing To Do.
no diff to binary
The != operator works for unordered_set now, and will work for TMBitset in the future.
cherrypick 39457ce
cherrypick 7547908
@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

TMArray

This is a simple wrapper around a C array, with an allocation function, destructor, and iteration.
Right now, it's only used for storing TravelerList objects, but eventually will be used for darn near everything.

Background:
In the very early days of writing the program, I had issues with invalid pointers.
For example, a Waypoint object is created in the Route::point_list vector, NMPs are found (this point points to others & others point to this), and a pointer is inserted into the quadtree. Later on, after adding more items into point_list, it runs out of capacity & has to reallocate, and all the waypoints are moved, invalidating all the pointers.
Sure, we know how many lines are in a .wpt file, and can thus call std::vector::reserve, obviating reallocation. If I was even aware of this function at that stage of the game, this solution didn't occur to me.
But wait, there's more!
As HighwaySystem objects are created in a master system list vector, Route objects are created, containing a pointer to their parent system. Resizing the vector as more systems are created means, again, invalidating all those pointers. This is probably the first place I ran into this problem, at the very start of getting siteupdate.cpp off the ground.
Here though, even calling vector::reserve will not work; HighwaySystem objects contain a mutex. Mutexes cannot be moved, therefore objects containing them cannot be moved either. Vectors require that their contents be moveable in order for reallocation to work, so attempting to create a vector<HighwaySystem> fails to compile.
I ended up keeping a kinda "Pythonic" approach, creating new objects on the heap individually, and saving pointers to them in the various containers. The objects stay where they're created, and no need to worry about points going invalid.
I used this approach for consistency's sake with most containers, probably even ones where there was no risk of invalidating pointers and no mutex preventing storage in a vector.

Motivation:
It's time to move away from this. Storing objects in contiguous memory is more efficient in several ways.

  • More predictable memory access patterns; consistently skip ahead sizeof(foo) bytes when iterating.
  • Cache locality. More cache hits.
  • Stop wasting RAM storing a bunch of unnecessary pointers.
  • It opens the door to more efficient algorithms that take advantage of pointer arithmetic. For example...
    • The TMBitset class. More on this below.
    • Replacing these two clunky functions.
    • Concurrency detection could benefit from this, not that it's all that slow right now.
    • Converting the three per-traveler regional tables to ordered maps could speed up userlogs and/or stats CSVs, at the cost of a potential minor slowdown when computing stats.
    • ...and much more.

If vector won't work, raw C arrays or, more conveniently, a new custom container class will. We can keep using range for loops via the begin() and end() functions.

Secondly, the destructor cleans up allocated memory when a TMArray object goes out of scope. This helps make siteupdate "valgrind clean", stopping valgrind from reporting a bunch of memory leaks, without having to worry about manually deleteing everything. LOL, and it doesn't contribute to the Total run time figure if it happens at the very end after that, as we exit main. If that's arguably cheating, it's a kind I'm OK with. :)

Does it use less RAM?

  • The object itself is 16 B, compared to 24 B for a vector or list.
  • For the memory it dynamically allocates to store its data, it depends on how we use it.
    • A TMArray<Foo> is no less efficient than a vector<Foo>. A vector<Foo> can be as efficient if we first reserve space before populating it, but most of the time that's not done; usually items are just added and the vector reallocates, resulting in unused extra space.
    • We are able to get rid of unnecessary pointers in our existing vector<Foo*> containers though.
    • A list<Foo> would contain pointers to the prev/next items at each list node.

Is it faster?
No faster than a vector<Foo>, but faster than what we have now.

  • Reduced indirection
  • As noted above, RAM & cache benefits
  • Indirectly, the potential for newer better algorithms.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

TMBitset

This is a replacement for unordered_sets of pointers.
It stores one bit for each TravelerList in the system, with on/off indicating whether or not it's in the set, rather than 64 bits for each TravelerList* in the set. Info for up to 512 travelers can fit in a single cache line, compared to only 8 64-bit pointers.
I went to kind of ridiculous lengths to Micro-Optimize All The Things.

Is it faster?
Yes.

  • Insertion:
    No more taking time hashing, mucking about with buckets & checking for duplicate values.
    The distance of a TravelerList's bit from the start of the bit array = the distance of its memory address from TravelerList::allusers.data. A little arithmetic, and we can flip one bit, and BOOM!
    Speeds up:
    • Processing traveler list files
    • Augmenting travelers for detected concurrent segments
  • Checking membership:
    No more taking time hashing & mucking about with buckets.
    Via the simple arithmetic above, we find a bit and see if it's 0 or 1.
    Speeds up:
    • Creating per-traveler stats logs and augmenting data structure
  • Comparing contents of one set to another:
    Before, unless the two sets were different sizes, we had to iterate through the contents of one (slow!) and check each item's membership in the other set (slow!) until we find one missing.
    Now, we have two arrays of equal length, with each TravelerList represented in the same position in both -- memcmp those babies and BOOM!
    Speeds up:
    • Compressing collapsed edges
  • Unions of sets:
    Before: iterate thru members of each set, check membership, add to a new master list, set membership to true.
    Now: Create a new master TMBitset, and instantly add up to 64 items at once via a Bitwise Or.
    Speeds up:
    • Traveled graph generation
  • Iteration:
    I was worried this might be slower than iterating an unordered_set. Turns out it's faster. Win!
    Speeds up:
    • Computing stats
    • Traveled graph generation
    • segments SQL table

Does it use less RAM?

  • The object itself is 32 B, compared to an unordered_set, which is 40 B on FreeBSD, 48 B on my CentOS boxes, and 56 B on Ubuntu. At 1,149,116 segments, we save 8.8, 17.5, and 26.3 MB respectively. Not bad. :)
  • For the memory it dynamically allocates to store its data, the short answer is yes, but it's more complex; it depends on the underlying data. With relatively few items in each set, an unordered_set can be more efficient. With relatively few total items to choose from, a TMBitset can be more efficient, which is what we have. An unordered_set<Foo*> uses 64 bits for every item in the set, whereas a TMBitset allocates 1 bit for every item whether in the set or not.
    With 377 travelers in the system, a TMBitset allocates 48 bytes. I could have had devel segments avoid allocating this space, but that'd only save 901 KB, so I opted to keep it simple. Thus all 1,149,116 segments allocate these 48 B, for a total of 52.6 MB. With 8,393,470 entries in the clinched DB table times 8 B, that's 64 MB.
    Or, on a per-segment basis (including devel segments because 1, it's the more apples-to-apples comparison because I allocate this memory with TMBitset and 2, it's the more conservative figure more favorable to unordered_set) 58 B vice 48.
    Plus, all of this doesn't even take into account all the extra RAM unordered_set has to allocate behind the scenes to get stuff to work under the hood.
  • Overall, 80f2724 -> 634da8b, System Monitor tells me I'm saving about 325 MB.

There are 2 template parameters. What's that all about?
The 1st one, item is pretty straightforward. It's the data type included (or not) in the set.
The 2nd one, unit, controls the size of the data chunks that are allocated and processed, and allows us to fine-tune how things behave.

  • Smaller units like uint8_t or unsigned char minimize the amount of unused RAM allocated.
    Right now with 377 travelers, it makes no difference; we're using 48 B either way.
  • Larger units like uint64_t or unsigned long allow the |= (bitwise or) operator to work more efficiently.
  • Unit size also affects how iteration behaves under the hood. The sweet spot depends on the underlying data; small units could be fastest, maybe large ones, maybe somewhere in between. I chose uint32_t for the current application.

Do I have bigger plans for this class?
Not so much. I only see 2 further applications right now. Neither uses unordered_set, so I'd expect to see less benefit than we see for graph generation with this pull request.

  • Vertex membership for subgraphs already uses an ad hoc bit field. There could be a tiny speedup in zeroing out all these bits between graphs, but not much else. The primary gain here is cleaner code.
  • HGEdge membership for subgraphs uses an array of numthreads bools, similar to what TravelerLists did until now. The potential benefits are in memory locality & cache hits. There's a bit of a heavier lift to get this one into place. Low priority, though I'd still like to try it eventually.

Even if there's not much more to be done, at least this has already provided a good speed-up in a number of areas.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

The first 2 tasks benefit from faster set insertions.

List

  • lab3 does a better job of scaling up to 23 threads now.
  • lab4 already made it to 23, but now scales all the way to 24, outperforming lab2 despite its lower clock speed & memory bandwidth.
  • bsdlab starts out, as before, comparable to lab3, but starts to suffer from FreeBSD's memory bandwidth issues & doesn't scale up to more threads as well. There's still some improvement though, and this effect isn't as pronounced as with other tasks. I've some ideas to speed this up which may or may not pan out; low priority.

Conc

  • Good performance with FreeBSD here, outperforming lab3 up to ~6 threads, comparable above that.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

Iteration at work.

Stat
I was pleasantly surprised to find iterating HighwaySegment::clinched_by faster with a TMBitset than an unordered_set.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

Checking set membership.

User

  • Big improvements here. LOTS of lookups. For every route in a system a user has traveled, we iterate though its segments, check whether each is clinched_by a traveler, and if so, add its mileage to the traveler's per-route total.
  • For FreeBSD, this helps at the extreme low end, but at 4+ threads we start to saturate the memory bus, and do about the same as before.
    This is very similar to its behavior with graph generation, where it's most efficient at just 1 thread, wall time increases quickly then plateaus from 4-10 threads, ramps up again till 18 threads, then falls off, ending up a bit slower than the first plateau.
  • I've some long-term low-priority ideas that could potentialy help out here too.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

Comparing contents of two sets:
The != operator at work.

Comp

  • FreeBSD does great here; compare its performance to lab3 (CentOS 7) & lab4 (Ubuntu 18.04) on the same hardware.
  • Big improvement for lab4; it's not terribly far behind bsdlab now.

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

Traveled graph generation benefits from iteration and unions of sets.

Graph
As seems to be the pattern with FreeBSD, graph generation improvements appear to save a tiny bit of time at 1 thread, and just not help out above that.
Still though, I remain cautiously optimistic. There are a number of avenues to explore targeting memory locality, HGEdge alignment & size, cache hits, etc.

Graph_Linux
Same thing, zoomed in to just the Linux machines. Performance seems to max out at ~6 threads. (For now at least?)

@yakra
Copy link
Contributor Author

yakra commented Jul 9, 2023

Total
Overall, siteupdate's peak efficiency on bsdlab is at just 1 thread. :( The good news is, it's faster @ 1 thread now than it was @ 2 before.
Graph generation is a big piece of the pie, and successfully speeding it up should make the picture look much better.
I didn't think of this till just now, but an interesting chart to look at would be total wall time in --errorcheck mode, with only userlogs being a real poor perfomer. It'll take me another day to produce the data for that though.

@jteresco
Copy link
Contributor

jteresco commented Jul 9, 2023

Wow, this looks great. Thanks for all the effort on this.

@jteresco jteresco merged commit 6851355 into TravelMapping:master Jul 9, 2023
@rickmastfan67
Copy link

So, @yakra, this is why you've been ignoring your TX threads on the forums, eh? LOL. ;)

@yakra
Copy link
Contributor Author

yakra commented Jul 10, 2023

Thanks! I'm particularly proud of TMBitset. 😀
Leveled up a bit at C++ in the process. This pull request includes the first uses of constexpr, friend, calloc, aligned_alloc, and placement new. Partial template specialization was a headache, but now I know some stuff. Tried out inheritance before settling on the TMBImpl template class. Experimented with thread_local variables, but they didn't make the final cut.

I didn't think of this till just now, but an interesting chart to look at would be total wall time in --errorcheck mode, with only userlogs being a real poor perfomer. It'll take me another day to produce the data for that though.

Or not. I've just started the benchmark script on bsdlab, the last machine left. That would take about 14 hours if graph generation were enabled, slowing overall speed down to a crawl. Without it? I'll guess "5 more hours" & see how close that is. May not be awake by then.

Having ticked off a bunch of boxes at #518, I just need to "sweep up the shop floor". Gotta delete a couple dozen local branches that were rebased, reordered, squashed, cherrypicked, or went nowhere. Some of the code that didn't make the final cut, I'd still like to save for future reference. Probably paste it into yakra#245.
After that, I plan to return to the forum shortly.

HA! Speaking of which, I see @rickmastfan67 wrote:

So, @yakra, this is why you've been ignoring your TX threads on the forums, eh? LOL. ;)

Yes, exactly that. LOL. All the threads.
I wanted laser-like focus on this sub-project, and chose to tune out other distractions while working on it. That, and the day job leaves a lot less time & energy for the project than I've had in the past. I think it's been what, a couple-few months since I've logged in / viewed the forum?
I've had a few forum threads open in other browser tabs this entire time, waiting for me to get back to them. If they're reloaded every time I restart firefox, that may give a false positive & show me as active on the forum, but yeah no, I've been MIA.
Will be back soon, I know I've at least one unread PM. And I know @mapcat has driven the newly-opened realignment of NS104 west of Antigonish.

@rickmastfan67
Copy link

I've had a few forum threads open in other browser tabs this entire time, waiting for me to get back to them. If they're reloaded every time I restart firefox, that may give a false positive & show me as active on the forum, but yeah no, I've been MIA.

Yep, every time you restarted Firefox, it pinged the forums and reloaded said threads. Thus, since you were logged in, updated your profile and said you were logged in.

Let's just say, you've possibly missed 20+ threads for your areas. :p lol. ;)

And the threads you had pinned in firefox, probably will not properly show what posts (if any) are 'new' to them, as after the first refresh of it due to the browser restart, that 'new' button will go bye-bye.

So, in the future, might I suggest creating a toolbar folder and save bookmarks to threads in that? That way, you'll not loose the 'new' part, as well as will not ping the forums saying you're active in you leave the threads pinned in Firefox as 'active'. Then, once you've addressed the thread, just delete the bookmark. :)

@rickmastfan67
Copy link

Oh, and the "Show unread posts since last visit." link will 99.9999% be useless to you as well, due to those pinned threads you left open in Firefox sadly. You'll have to use the "Click here to try all unread topics." link instead and dig thru it to find threads in your areas, as that will be the only way to properly see what's unread to your account.

@yakra
Copy link
Contributor Author

yakra commented Jul 10, 2023

And the threads you had pinned in firefox, probably will not properly show what posts (if any) are 'new' to them, as after the first refresh of it due to the browser restart, that 'new' button will go bye-bye.

No big deal on these probably. Most of them weren't very active. I can take the .msg65535#new, change it to .msg65535#msg65535 go by what's onscreen and look at the post dates if necessary.

Oh, and the "Show unread posts since last visit." link will 99.9999% be useless to you as well, due to those pinned threads you left open in Firefox sadly.

For whatever reason, that link's never really been on my radar. I'll probably just stick to my usual MO...
Middle-click each sub-board to open in a new tab. IIUC, the "new" notifications here won't be accurate due to my "logins", so just click everything. There's a small number of sub-boards thankfully.
Look for topics with unread posts -- these should be tagged new regardless of my logins. And then middle-click a bunch more tabs. :P

You'll have to use the "Click here to try all unread topics." link instead

Interesting; don't think I was aware of this one. Is it linked from the main page?

@yakra
Copy link
Contributor Author

yakra commented Jul 10, 2023

Anyway, here's the chart of total run time in --errorcheck mode.
ErCh
Same vertical scale as the total time chart above, for easy side-by-side comparison with & without graph generation & the .sql file.

ErCh_zoom
Zoomed to fit.

@rickmastfan67
Copy link

You'll have to use the "Click here to try all unread topics." link instead

Interesting; don't think I was aware of this one. Is it linked from the main page?

No. It's inside of the "Show unread posts since last visit." area. That specific phrase only shows up if there's 0 unread posts since your last visit. Otherwise, look for the "All Unread Topics" button, as that's the same link, just different text.

@michihdeu
Copy link
Contributor

So, @yakra, this is why you've been ignoring your TX threads on the forums, eh? LOL. ;)

Yes, exactly that. LOL.

Good to know 😃 I was a little bit worried meanwhile... You didn't miss much but lost your pole position in the "Top 10 Posters" ranking 🤣

@yakra
Copy link
Contributor Author

yakra commented Jul 16, 2023

#592 (comment)

Do I have bigger plans for this class?
Not so much. I only see 2 further applications right now. Neither uses unordered_set, so I'd expect to see less benefit than we see for graph generation with this pull request.

  • Vertex membership for subgraphs already uses an ad hoc bit field. There could be a tiny speedup in zeroing out all these bits between graphs, but not much else. The primary gain here is cleaner code.

On further thought...

  • Creating vertex sets for multi-region (and country & continent) and multi-system graphs could benefit from |=ing together multiple region or system sets.
  • Full custom graphs could create intersections of these multi-region & multi-system sets with a new & or &= operator.

@michihdeu wrote:

Good to know 😃 I was a little bit worried meanwhile... You didn't miss much but lost your pole position in the "Top 10 Posters" ranking 🤣

I take that to mean, not that I've dropped all the way out of the top 10, but just lost the # 1 spot. 😁 Honestly, I've kinda wanted that to happen for a while.
And unless things have changed drasticaly since I was last around... congrats! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

canonical segments for graph edges TMBitset checklist handling nonexistent .list files
4 participants