Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ site update to production #375

Closed
jteresco opened this issue Dec 11, 2020 · 17 comments
Closed

C++ site update to production #375

jteresco opened this issue Dec 11, 2020 · 17 comments

Comments

@jteresco
Copy link
Contributor

I'd like to work toward using the C++ version of the site update program. @yakra are you confident in the threaded version's readiness for production site updates?

I think that moving to the C++ program its inherent efficiency advantages over Python and the ability to do some meaningful multithreading, in conjunction with the addition of SSDs and/or using memory filesystem could really speed the whole process significantly.

@jteresco
Copy link
Contributor Author

A first question on this: what C++ compiler and version do you use when you build it for Linux?

g++ 9.3.0 compiles and links, but I get a run time error on noreaster:

ld-elf.so.1: /usr/local/lib/gcc5/libstdc++.so.6: version GLIBCXX_3.4.22 required by /tm-tmpfs/travelmapping/DataProcessing/siteupdate/cplusplus/a.out not found

clang 10.0.1 doesn't compile it successfully, being unhappy with the use of subscript operators.

@yakra
Copy link
Contributor

yakra commented Dec 12, 2020

@yakra are you confident in the threaded version's readiness for production site updates?

Other than #281 (just cosmetic, an easy fix for a clean readable siteupdate.log), yes! I believe I have mutexes around everything else I need to, and even some things I don't. The last crashes & glitches were fixed in #333. Since then I've run siteupdate many thousands of times during speed tests with no other known issues except #327, which "is super obscure, and not too big a deal in the real world". It shouldn't hold the C++ version back from going live, and may become moot anyway once multi-threaded HighwayGraph setup enters the picture.

I think that moving to the C++ program its inherent efficiency advantages over Python and the ability to do some meaningful multithreading, in conjunction with the addition of SSDs and/or using memory filesystem could really speed the whole process significantly.

No need for super-fast storage subsystems. NMP_merged files take about 0.5s on lab2. Unsure what the exact bottleneck is here, but everything else is CPU/memory bound. Graph files are written to disk at less MBps than nmp_merged.
Edit: Initially reading WPTs. I keep forgetting about that. It's fast if the data has been read recently & is cached, but can crawl otherwise.

A first question on this: what C++ compiler and version do you use when you build it for Linux?

g++ 9.3.0 compiles and links, but I get a run time error on noreaster:

ld-elf.so.1: /usr/local/lib/gcc5/libstdc++.so.6: version GLIBCXX_3.4.22 required by /tm-tmpfs/travelmapping/DataProcessing/siteupdate/cplusplus/a.out not found

Yup. Same message I got.
In the immediate term, you can disable threading by commenting out

#define threading_enabled // comment out this line to disable threading on noreaster or FreeBSD systems.

What versions do I use?

box(es) ver
lab1, lab2 g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
lab3 g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
BiggaTomato g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
lab1.5, epoch g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

clang 10.0.1 doesn't compile it successfully, being unhappy with the use of subscript operators.

I'm new to clang *checks wikipedia article*.
Interesting. What does it choke on?
I'm using C++11, so on my linux boxes I use g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate
On noreaster, we don't even need those switches; once I comment out enable_threading.cpp, just g++ siteupdate.cpp -o siteupdate does the trick.

@jteresco
Copy link
Contributor Author

All right, good. We can think about moving in this direction.

As far as clang, it gives pages of errors, many of which look like this:

In file included from siteupdate.cpp:72:
./classes/DatacheckEntry.cpp:69:14: error: type 'std::array' (aka 'array, allocator >, 6>') does not provide a subscript operator
                if (fpentry[0] != route->root)  return 0;
                    ~~~~~~~^~

I don't think it really matters to me if we compile with clang, g++, or something else, so I wouldn't consider that a high priority.

Here's a new Issue about getting this compiled and running with pthreads support: #376

@yakra yakra mentioned this issue Dec 13, 2020
2 tasks
@yakra
Copy link
Contributor

yakra commented Jan 21, 2021

My working tree on noreaster has (for the moment) some changes to 4b9c719 to enable compilation with clang++.
siteupdateST compiles fine with g++, and runs to completion.
With clang++, siteupdate & siteupdateST crash when Sorting colocation lists. Somehow, there's an invalid pointer for Waypoint::colocated. Strangely, always at la.la1088 LA36 (30.433911,-89.91344), even when multi-threaded.

Sorting WaypointQuadtree at (30.41015625,-90) to (30.5859375,-89.6484375) contains 37 waypoints
Points sorted. Sorting colocation lists...
...
3 la.la0036 LA1088 (30.433911,-89.91344)
0x912e2a0e0
...
19 la.la1088 LA36 (30.433911,-89.91344)
0x12e2a0e0

Even though this memory is allocated on the heap, I'm seeing consistent values for all Waypoint::colocated locations in the quadtree node, (No problem, that helps!)
What's really weird here is how the 2 addresses above are the same except for the missing high-order nibble.

Even if I comment out Sorting colocation lists, there's still a crash when Setting up for graphs of highway data. No surprise here, as we'd still encounter the invalid Waypoint::colocated pointer and attempt to dereference it.

  • How does this happen?
  • How early in the process?

This will take a pretty deep dive, one that I don't have the attention span for right yet.
Two other projects have each yielded an unmanageably large number of experimental branches. I want to play arborist, prune extraneous branches, and merge changes back into master before proceeding.

@yakra
Copy link
Contributor

yakra commented Jan 22, 2021

The bug above is reproducible with
HighwayData @ d5d320322c2e46a722299b6bd1c8a6e25526fcbd
UserData @ 3491f1b4e5dbe11a4d470ed67257bb929442af22 & 618d6c4616543c3fdee9273ddc012264e46e5303

Starting to get the impression it's related to a specific point (or range rather) in memory.

  • To be more certain, what I really want is the address of colocated within the Waypoint object, not the address stored in the pointer.

Checking out an older HighwayData commit to cause all the waypoint objects (and thus that particular colocation list pointer) to be stored in different places seems to do the trick. The pointer stays valid and siteupdate runs to completion.

So what's the deal? Is there bad RAM? Why do we lose exactly 0x900000000, with two ones turning to zeros?

  • I'll take a closer look at where exactly that colocation list pointer is itself stored...

@yakra
Copy link
Contributor

yakra commented Jan 22, 2021

What memory locations?

I'll take a closer look at where exactly that colocation list pointer is itself stored...

0x912d86208 consistently with siteupdateST.
0x914848208 - 0x914898208 with 7 trials of siteupdate -t 1.
0x919287768 - 0x91a957428 with 8 trials of siteupdate -t 4.

Doing a git bisect from HighwayData @ 27a6254 which ran OK, the crash first appears...

  • with siteupdateST at HighwayData @ 60d774f. Right when Waypoint::colocated locations are first starting to edge over 0x900000000.
    0x900080048 fails.
  • with siteupdate -t 4 at HighwayData @ 9041c49.
    0x901052a28 fails.

siteupdate.log with some debug text:

usala...................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
..................................................................................................................................................................DEBUG:  this=la.la1088
 LA36 (30.433911,-89.91344)
DEBUG:  this->colocated=0x9000aee40
DEBUG: other->colocated=0x9000aee40
........................................................................................................................................................................................
.............................................................................................................................................................!
usanyp......................!
usapr...................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
..................!
usaush...................................................................................................................!
zafr..........................................................!
zmbt......!
zmbm..............!
zwer.........!
zwep............!
afrtah.............................................!
argrn......!
asiaho...........!
asimo.................................!
ausb.....!
cannss...................!
codn..............................!
fraarad03.!
frabfcd25..!
frabfcd39...!
frabfcd58.!
frabfcd70...!
fragesd51....!
frahdfd02.....!
mexsf....................................................................!
naca..............!
ngae.!
rusr....................!
[23.0] Sorting waypoints in Quadtree.
#0      *0x87d2e8cc8    la.i059 11 (30.46041,-89.697297)
         0x883b6a540    la.us011 I-59(11) (30.46041,-89.697297)
         0x883b6a540    la.i059 11 (30.46041,-89.697297)
#1      *0x87d2e8d68    la.i059 LA/MS [alt: ['+999']] (30.46262,-89.695)
         0x87cf03580    ms.us011 LA/MS (30.46262,-89.695)
         0x87cf03580    la.us011 LA/MS (30.46262,-89.695)
         0x87cf03580    ms.i059 LA/MS [alt: ['+0']] (30.46262,-89.695)
         0x87cf03580    la.i059 LA/MS [alt: ['+999']] (30.46262,-89.695)
#2      *0x8fe902948    la.la0036 AirSt (30.452215,-89.988221)
#3      *0x8fe9029e8    la.la0036 LA1088 (30.433911,-89.91344)
         0xaee40        la.la1088 LA36 (30.433911,-89.91344)
         0x9000aee40    la.la0036 LA1088 (30.433911,-89.91344)
#4      *0x8fe902b28    la.la0036 LA41 (30.416481,-89.784994)
         0x87d71cdc0    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0036 LA41 (30.416481,-89.784994)
#5      *0x8fe902a88    la.la0036 LA434 (30.427176,-89.885888)
         0x8ff582240    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0036 LA434 (30.427176,-89.885888)
#6      *0x8fe9725e8    la.la0041 FirRd (30.465284,-89.811687)
#7      *0x8fe9724a8    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0036 LA41 (30.416481,-89.784994)
#8      *0x8fe972728    la.la0041 LA435 (30.534172,-89.868121)
         0x8ff582340    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0041 LA435 (30.534172,-89.868121)
#9      *0x8fe9727c8    la.la0041 LA435Spr (30.542387,-89.875653)
         0x8ff582420    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0041 LA435Spr (30.542387,-89.875653)
#10     *0x8fe972688    la.la0041 MaxMerRd (30.499715,-89.814252)
#11     *0x8fe972548    la.la0041 TeatBlaRd (30.438203,-89.787827)
#12     *0x8ff53e288    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0036 LA434 (30.427176,-89.885888)
#13     *0x8ff53e5a8    la.la0435 +X335304 (30.534154,-89.902582)
#14     *0x8ff53e508    la.la0435 HilBlvd (30.50565,-89.97365)
#15     *0x8ff53e6e8    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0041 LA435 (30.534172,-89.868121)
#16     *0x8ff53e648    la.la0435 LA435Spr (30.532564,-89.875481)
         0x8ff582360    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435 LA435Spr (30.532564,-89.875481)
#17     *0x8ff53e8c8    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0041 LA435Spr (30.542387,-89.875653)
#18     *0x8ff53e828    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435 LA435Spr (30.532564,-89.875481)
#19     *0x900080048    la.la1088 LA36 (30.433911,-89.91344)

What are we looking at here?

  • This examines the waypoints in WaypointQuadtree at (30.41015625,-90) to (30.5859375,-89.6484375) contains 37 waypoints
  • The first line for each, with the point number, contains the address of that Waypoint object's colocated variable. With a *.
  • Subsequent lines list each point in the *colocated list if there is one, and the value stored in their colocated pointer. These should all be the same in each group, yet check out # 3...
  • Once we get to # 19, la.la1088 LA36 is the point with the bad colocated pointer per # 3. We try to dereference it, and there's our crash.

When does it happen?

How early in the process?

In the siteupdate.log excerpt above, note how when detecting colocations while reading in WPTs for usala, both la.la1088 LA36 (30.433911,-89.91344) and la.la0036 LA1088 (30.433911,-89.91344) have the same value in their colocated pointer.
Just to show I did it right and am not going crazy, here's the code the produces that notice:

diff --git a/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp b/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
index 334524e..296c17d 100644
--- a/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
+++ b/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
@@ -58,6 +58,19 @@ void WaypointQuadtree::insert(Waypoint *w, bool init)
                                }
                                other_w->colocated->push_front(w);
                                w->colocated = other_w->colocated;
+
+                               // <<< DEBUG
+                               if (w->str() == "la.la1088 LA36 (30.433911,-89.91344)")
+                               {       std::cout << "DEBUG:  this=la.la1088 LA36 (30.433911,-89.91344)" << std::endl;
+                                       std::cout << "DEBUG:  this->colocated=" << (void*)w->colocated << std::endl;
+                                       std::cout << "DEBUG: other->colocated=" << (void*)other_w->colocated << std::endl;
+                               }
+                               if (other_w->str() == "la.la1088 LA36 (30.433911,-89.91344)")
+                               {       std::cout << "DEBUG: other=la.la1088 LA36 (30.433911,-89.91344)" << std::endl;
+                                       std::cout << "DEBUG:  this->colocated=" << (void*)w->colocated << std::endl;
+                                       std::cout << "DEBUG: other->colocated=" << (void*)other_w->colocated << std::endl;
+
+                               }// DEBUG >>>
                        }
                }
                if (!w->colocated || w == w->colocated->back())

What happens after this?

  • For the remainder of WaypointQuadtree::insert, we don't change the value of colocated; only check whether non-zero, and dereference it. We then return to Route::read_wpt.
  • For the remainder of Route::read_wpt, we run datachecks...
    • For the per-route datachecks, the code is right there. No usage of colocated.
    • For the single-point datachecks, we must look at Waypoint.cpp. Only DUPLICATE_COORDS uses colocated, again only checking whether non-zero and dereferencing.
    • ... and add a new HighwaySegment to segment_list.

Route::read_wpt finishing brings us to...

siteupdate.cpp

      #ifdef threading_enabled
	// stuff that never happens in siteupdateST because of the #ifdef
      #else
	for (HighwaySystem* h : highway_systems)
	{	std::cout << h->systemname << std::flush;
		bool usa_flag = h->country->first == "USA";
		for (Route* r : h->route_list)
			r->read_wpt(&all_waypoints, &el, args.highwaydatapath+"/hwy_data", usa_flag, &all_wpt_files);
		std::cout << "!" << std::endl;
	}
      #endif

	// DEBUG //
	list<Waypoint*> schmist = all_waypoints.point_list();
	schmist.sort();
	ofstream schmile("schmile");
	for (Waypoint *w : schmist)
	{	schmile << &w->colocated << '\t';
		schmile << w->colocated << '\t';
		if (!w->colocated) schmile << '\t';
		schmile << w->str() << std::endl;
	}

	cout << et.et() << "Sorting waypoints in Quadtree." << endl;

Top: existing code in main, for context, as "Reading waypoints for all routes." finishes off.
Bottom: new debug code that iterates through all Waypoint objects by memory address, listing for each:

  • the address of its colocated variable
  • and the value stored in its colocated pointer.

What do we see here?

[yakra@noreaster ~/TravelMapping/DataProcessing/siteupdate/cplusplus]$ egrep 'la.la1088 LA36|la.la0036 LA1088'  schmile
0x8fe9029e8	0x9000aee40	la.la0036 LA1088 (30.433911,-89.91344)
0x900080048	0xaee40	la.la1088 LA36 (30.433911,-89.91344)

It'd seem that nothing else happens after WaypointQuadtree insertion before this sanity check, and that this one pointer's value magically changes. I'd almost have no rational explanation other than possibly bad RAM, but not all is necessarily as it seems.
To Be Continued...

@yakra

This comment has been minimized.

@yakra

This comment has been minimized.

@yakra

This comment has been minimized.

@yakra
Copy link
Contributor

yakra commented Jan 26, 2021

  • Everything after all_waypoints->insert(w, 1); for the remainder of Route::read_wpt commented out. Even delete[] wptdata. Still crashes.
  • If it's something in the code, it's in WaypointQuadtree::insert...

Home come the pointer is fine if located < 0x900000000, but gets mangled if located > 0x900000000?
(Worth noting that the g++ version doesn't store Waypoints in this memory range -- more 0x84####### ballpark. Could this be why it doesn't crash?)

Narrowing down, with main returning 0 after sorting quadtree. Assuming addresses are lower due to fewer instructions in memory

address status HwyData
commit
tweaks
0x8ffefbc08 OK 60d774f distance_update disabled
0x8fffb83c8 OK 60d774f first 3 datachecks, invalid_ends, selfref, slashes, sharp_angle disabled
0x8fffd2508 OK 60d774f first 3 datachecks, invalid_ends thru slashes, sharp_angle disabled
0x8fffe6928 OK 60d774f first 3 datachecks, invalid_ends thru lacks_generic, sharp_angle disabled
0x8fffe7508 OK 60d774f annoying US70
0x8fffe7b48 OK 60d774f ↑ all but last 3 records of annoying US60
0x8fffe7f08 crash 60d774f ↑ all but last 2 records of annoying US60
0x8fffe8c08 crash 60d774f ↑ all of annoying US60
0x8ffff6968 crash 60d774f ↓ deleted ME166A
0x8ffff9be8 crash 60d774f first 3 datachecks, invalid_ends, selfref, slashes disabled
0x8ffff9d28 crash 60d774f first 3 datachecks, invalid_ends, selfref disabled
0x90004ba68 crash 60d774f first 3 datachecks, invalid_ends disabled
0x90004bb08 crash 60d774f first 3 datachecks disabled
0x900052048 crash 60d774f no changes to read_wpt.cpp

@yakra
Copy link
Contributor

yakra commented Jan 26, 2021

If it's something in the code, it's in WaypointQuadtree::insert...

...Or is it?
In Waypoint.h, moving std::list<Waypoint*> *colocated; earlier in the class definition, the 2nd item after only Route *route;, makes the problem go away. Was it a failure of a pointer to fall right on an 8-byte boundary, causing wacky antics?

  • Does sizeof(Waypoint) differ between clang & g++? Yes. 152 & 192 respectively.
  • Does sizeof(Waypoint) change when the header is rearranged? No for both compilers.

yakra added a commit to yakra/DataProcessing that referenced this issue Jan 31, 2021
* C arrays for updates entries
* colocated moved with Waypoint (there appear to be sizeof/offset issues in clang, as noted in TravelMapping#375)
yakra added a commit to yakra/DataProcessing that referenced this issue Jan 31, 2021
* C arrays for updates entries
* colocated moved within Waypoint (there appear to be sizeof/offset issues in clang, as noted in TravelMapping#375)
@yakra yakra added the BSD label Jun 21, 2021
@yakra
Copy link
Contributor

yakra commented Jul 6, 2021

A full pass thru siteupdate with valgrind will be worthwhile.
It reported there were some errors / things to look at, almost all of which I glossed over in the rush to solve #443.
It'll be good to see what may be lurking in the shadows. And maybe my memory management (and whatever else) is a bit sloppy.

@jteresco
Copy link
Contributor Author

jteresco commented Jul 6, 2021

"[your] memory management" -> "C++'s memory management"

@yakra
Copy link
Contributor

yakra commented Jul 7, 2021

Heh. My memory mgmt too, considering what I've done (and not done) knowing the language's shortcomings.
Like, take all the

foo* bar = new foo;
	   // deleted on termination of program

littered throughout the source. I explicitly delete very little of it. Waypoints, HighwaySegments, HighwayGraph components, TravelerLists? I wait till the program ends, and System Monitor says I magically get all those gigs of space back, and my RAM use is back where it was before I ran siteupdate. (Don't know if that's a compiler thing, or an OS thing, or how that works...)

Looks like Valgrind has something to say about all this.
What I'm doing may not exactly be best practice. :)


Another minor surprise I didn't look into was something about a branch relying on uninitialized data. Edge construction or compression IIRC. A bit of a surprise; I'd think I'd notice some effects...

@yakra
Copy link
Contributor

yakra commented Jul 7, 2021

Another minor surprise I didn't look into was something about a branch relying on uninitialized data. Edge construction or compression IIRC. A bit of a surprise; I'd think I'd notice some effects...

It happens exactly once, for US19TrkPit.

// checks for the very unusual cases where an edge ends up
// in the system as itself and its "reverse"
for (HGEdge *e : vertex1->incident_s_edges)
if (e->vertex1 == vertex2 && e->vertex2 == vertex1)
{ delete this;
return;
}
for (HGEdge *e : vertex2->incident_s_edges)
if (e->vertex1 == vertex2 && e->vertex2 == vertex1)
{ delete this;
return;
}
format = simple | collapsed | traveled;

format needs to be initialized before the destructor is called.

We get 3 errors when the destructor calls detach.
The branch itself is no-harm-no-foul: we either do or don't search the endpoint's adjacency lists for this, and find nothing.
Then one more error: half the time we'll neglect to delete the s_written array, and leak 1 byte per thread.

@yakra
Copy link
Contributor

yakra commented Jul 9, 2021

Then one more error: half the time we'll neglect to delete the s_written array, and leak 1 byte per thread.

Nope. We neglect to delete it all the time, even with the fix in place, leaking 1 byte per thread per edge.
(As of HwyData@0a07f53 & UserData@412dfaf2, that's 932211 bytes per thread.)

The destructor calls detach, which zeros out format, thus if (format & simple) delete[] s_written; never happens.

Best to delete s_written when the edge is detached from the simple graph.
Similarly deleting c_written and t_written upon detach avoids having them hanging around being useless until the edge is deleted.
This eliminates the need for an explicitly defined HGedge destructor altogether. Cleaner code for the win.

Better yet?
Separate simple/collapsed/traveled arrays waste RAM when only one bit of each element is needed.
Instead, have one single written array & use a bitmask. Adaptation should be easy; most of the heavy lifting is already done.

  • Losing 2 pointers saves 16 bytes per edge.
  • For the arrays themselves, we save 1 (s_written is only allocated for edges in the simple graph) or 2 bytes per edge per thread.
  • At the aforementioned commits, that's 14,915,376 B & 1,754,407 B per thread respectively. Just under 16 MiB at 1 thread.
  • Fewer lines of code FTW.
  • Any noticeable effects on speed?
qty format description
299120 1 simple only
26 2 collapsed only
0 3 simple + collapsed
These should never exist. Any vertices collapsed for the
traveled graph are also collapsed for the collapsed graph.
15 4 traveled only
38 5 simple + traveled
109974 6 collapsed + traveled
523038 7 simple + collapsed + traveled
Everything built by the first constructor starts out as one of these.
932211 Total all edges in all master graphs

@jteresco
Copy link
Contributor Author

jteresco commented Jan 1, 2023

We're there!

@jteresco jteresco closed this as completed Jan 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants