-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph generation: how many/which users have clinched each segment #196
Comments
I should bring in @yakra's recent site update changes first and go from that code. |
This would be a new version number in the header of TMG files, and a new graph type. I'd only do this for collapsed graphs. I am rarely using the simple format graphs anymore. |
@yakra wrote:
@jteresco wrote:
@yakra wrote:
Thankfully, cases of this happening in the real world should be relatively rare, and we won't be saddled with bloated & ugly graphs. How do we do this efficiently? Spitballing...
|
On the downside:
Should be easy to include. In a traditional TMG file, our number of lines is known, 2+v+e. Add one more line after all this simply listing the corresponding user IDs in order. It makes sense to me to include the hex string before the shaping points' coords, as there are an indeterminate number of shaping points before end of line. |
OK, I've got a draft implemented in C++. MEX-QROO-region.tmg:
To break down the hex string:
Parsing the first one above, for example:
NH-region.tmg:
There's a diff in the character representing travelers 208-211. Twos bit stores traveler 209: terescoj. Notes:
|
This was due to a bug in the reworked |
Grumble. Hex codes are looking more deterministic, but I can tell the final character frequently gets mangled. With 238 list files, we should only have the ones bit or twos bit set here, and thus have nothing > 3. But I'm seeing lots of Buggy code at: https://github.com/yakra/DataProcessing/tree/tmg2_cpp |
std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{ std::string code;
std::vector<unsigned char> clinch_vector(ceil(traveler_lists->size()/4)*4, 0);
for (TravelerList* t : clinched_by)
clinch_vector[t->traveler_num] = 1;
for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
{ unsigned char nibble = 0;
nibble |= clinch_vector[t];
nibble |= clinch_vector[t+1]*2;
nibble |= clinch_vector[t+2]*4;
nibble |= clinch_vector[t+3]*8;
if (nibble < 10) nibble += '0';
else nibble += 55;
code.push_back(nibble);
}
return code;
} -> std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{ std::string code;
std::vector<unsigned char> clinch_vector(ceil(traveler_lists->size()/4)*4, 0);
for (TravelerList* t : clinched_by)
clinch_vector.at(t->traveler_num) = 1;
for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
{ unsigned char nibble = 0;
nibble |= clinch_vector.at(t);
nibble |= clinch_vector.at(t+1)*2;
nibble |= clinch_vector.at(t+2)*4;
nibble |= clinch_vector.at(t+3)*8;
if (nibble < 10) nibble += '0';
else nibble += 55;
code.push_back(nibble);
}
return code;
} ...and now I can see stuff breaking:
OK, now I'm getting somewhere... Oddly, in the same session, we have |
|
Wow, thanks @yakra for jumping on this! It would be really great to get a preliminary set of graphs, even if not perfect, generated this week so I could share the files and file format with my students who might want to work with this data for an upcoming project. |
Just committed that last bugfix; graphs are looking good so far. Next up, uploading the latest code to noreaster, compiling, and producing some graphs there. |
preliminary graphs: http://yakra.teresco.org/tmtools_demos/tmg_2.0a/ |
This is fantastic. A couple quick requests if you have time to make it happen.
|
Already done; I never changed over the simple format. The tricky bit:
|
Yes, I'm thinking 3 for each instead of 2. I don't want to just replace the old ones since there's a lot of code, HDX and some of my Java code for class stuff, that I don't want to have to rewrite to ignore the new information. I'll worry later about getting the graphs download page to find the new stuff. I hope to look at the new graphs and try some things out with them in the next day or two. |
Quick thought that you are free to ignore, @yakra. Many of our graphs will have lots of segments with very few travelers, meaning many 0 digits. I wonder about an alternate "sparse" format for the codes in those cases, where we instead have a list of actual traveler numbers, maybe again in hex to save a few characters. Plus side: save a little space. Negative side: complication of our format and the need to calculate both possibilities on graph generation and the ability to decode both possibilities on any code that loads in the graphs. |
On the plus side, I recently upgraded BiggaTomato to 8 GB, so I've less to worry about if we're introducing more data. But, on to more brainstorming... An idea to save RAM, that could also simplify (?) things if we start to get into 3+ graph data sets:
I left the original
One thought I had was run-length encoding for zeros. Follow a
Thinking about this some more, I bet the RLE scheme might save more space, and remove the need to determine which encoding scheme to read/write. |
For the moment, it would be really great and get me through the initial usage this semester if you could generate the set again but with "traveled" instead of "collapsed" in both the filenames and the first line of each file. Then we can think about how best to proceed longer-term. |
Also, don't feel any obligation to generate new graphs based on my previous comment. The ones there can easily be used by my students. I might just rename them and change the one word at the top and go with them. I can't wait to have time to enhance HDX to support the new format and experiment with ways to visualize this info. |
run-length
|
I have what I need for now - I grabbed the graphs in tmg_2.0a, renamed the collapsed files and replaced "collapsed" with "traveled". Many, many thanks for getting this off the ground so quickly. |
Regarding RLE vs. the hex string, I can see advantages to either. The hex string is a little simpler to deal with, but 35% savings in graph size is something to consider. On the RLE side, I agree limiting to a single hex digit per "run" makes a lot of sense to me. As far as usability for my purposes, I think the hex string will be easier to explain and a tad easier to use. I'm going to build a Java program today to read in these graphs as a starting point for my students' projects. |
OK, let's go with that then. If the students are just starting to learn about this, let's not go confusing matters with extra info or procedures.
Same here; it'll be fun to see the results.
OK, good to hear. I'll save that change for when I make the changes for 3 graph sets, and avoid cluttering the commit history with short-lived edits that get undone in the next version. Odds & ends
|
Just dumping this here in case I want to refer to it in the future. std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{ // Return a hexadecimal string encoding which travelers have clinched this segment, for use in "traveled" graph files
// Each character stores info for traveler #n thru traveler #n+3
// The first character stores traveler 0 thru traveler 3,
// The second character stores traveler 4 thru traveler 7, etc.
// For each character, the low-order bit stores traveler n, and the high bit traveler n+3.
std::string code;
std::vector<unsigned char> clinch_vector(ceil(double(traveler_lists->size())/4)*4, 0);
for (TravelerList* t : clinched_by)
clinch_vector.at(t->traveler_num) = 1;
bool prev_zero = 0;
for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
{ unsigned char nibble = 0;
nibble |= clinch_vector.at(t);
nibble |= clinch_vector.at(t+1)*2;
nibble |= clinch_vector.at(t+2)*4;
nibble |= clinch_vector.at(t+3)*8;
if (nibble < 10) nibble += '0';
else nibble += 55;
// run-length encoding for zeros
if (nibble == '0')
{ if (code.empty() || code.back() == 'F')
prev_zero = 0;
if (prev_zero)
if (code.back() == '9') code.back() = 'A';
else code.back()++;
else { prev_zero = 1;
code += "00";
}
}
else { prev_zero = 0;
code.push_back(nibble);
}
}
return code;
} More efficient to move the "int"->printable char conversion into the final else block, but that's all academic if this won't be implemented anyway. |
Note to self, remember to:
|
Progress report:
siteupdate.cpp compiles, but will not run properly yet, as I still need to address the items in the list above. |
The new code is mostly in place. There are several bugs to sort out, and the first one is a doozy. |
I think I'm writing, and maybe also reading, thru a rogue pointer. It may be more productive in the near-term to try an implementation in siteupdate.py |
DataProcessing/siteupdate/cplusplus/classes/GraphGeneration/HGCollapsedEdge.cpp Lines 38 to 47 in 727c000
Cue the "One cannot simply" meme. I don't believe this bug is responsible for the full set of symptoms I was seeing; I think it should only result in garbage graph data. Still needs to be fixed though. How to most efficiently do this will require a bit of thought, for C++ and Python alike. Meanwhile, I've been working on the Python implementation.
|
Hidden vertices not collapsed in traveled graphsThese have exactly two incident edges, whose sets of
I'll give these a look over, and for the ones in my regions, see about either unhiding them or collapsing into visible points. |
Python progress report:
|
subgraphs & |
Hex codes are in place. The first couple I checked out look OK, but doing a little more QA is still a good idea.
Once the Python changes get merged into master (or I suppose even before), I'll start the C++ implementation over, mirroring the changes commit-by-commit. |
Did more QA. Looks good. Time trials on BiggaTomato:
Potential speed improvements:
This conflicts with #199. We can have one enhancement or the other, but not both. |
Storing & retrieving Another way of looking at it is that full TMG 2.0 traveled implementation added 66s of processing time vice 15c7bd6, so we've managed to trim back > 27% of that. But still... outside this context, 18.2s savings altogether is a bit underwhelming. Edit: Add in > 28 MB of RAM for the strings, and I'm less than wild about this. I'll have a crack at #199, and see what filesize savings and processing time penalty it yields. |
The C++ redo is going well.
* Raw collapsed or traveled TMGs are not directly DIFFable, because vertices, edges and intermediate points are not ordered deterministically. This required an intermediate step, a separate program to sort & reorder TMGs into a "canonical" file before DIFFing. |
C++:
|
HidClPtsLogDead-end branch, to be deleted.
Rather than have this as its own discrete loop, it could be integrated into edge compression. |
Yes, I think so. Closing. |
This could be a high-priority project for me in the next week or so if it's not any harder than I'm thinking. I'd like my students to be able to use it very soon.
I'd like to enable more algorithms and variations on algorithms to use the graphs. One additional bit of data that could be very interesting to work with could be the number of users or better yet a set of users who have clinched the highway segment(s) represented by each edge. This could allow all kinds of new problems to be solved interesting academically, to TM project users, or hopefully both.
From a first quick look at this, it looks like it's not hard to remember the user list for each edge, as the edges are initially created from segments, segments have a list of who clinched them, and in almost all cases, a collapsed edge should have the same list of users who have clinched each original edge. So then when the graphs are being written, the extra information could be written with each edge entry. The simplest way would be to list, maybe comma-separated, the users, but that's a lot of extra text in the files. A better way might be to assign each user a number 0 to n-1. Then the clinched info in the graphs is anonymous unless you have the mapping of users to numbers. But then each number really just represents if the user with that number has clinched the segment/edge in question. One bit of information. This means that same info could be stored in a much more compact form: a hex number with n bits, so ceiling(n/4) hex digits, where each bit corresponds to whether a specific user has clinched that segment.
The text was updated successfully, but these errors were encountered: