-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace hashing of unique ids with .zipWithUniqueId() #243
Comments
I found a solution, which still needs testing, but current time trials using this code and the UVIC local news warcs:
|
The new way is definitely slower, but within 10-20%. |
Thanks for this @greebie! This new approach seems better, but I'm just weighing the cost/benefit of longer processing time vs. avoiding all hash collisions. Right now, when we have collisions, Gephi automatically fixes those is that correct? |
Not really. Gephi merges items that have the collisions, so with large collections, it could produce a misrepresentation of the graph. However, I have not come across it in this example. |
Why not You just need the ids to be unique - they don't need to be sequential, right? |
I'm using zipWithUniqueId(). :) |
oh, misread then. In the first comment in the issue you wrote:
|
That's right - the current pushed branch uses zipWIthUniqueId() instead for the reasons you said. (I changed the issue title to avoid future confusion.) |
Okay - I've decided to keep the existing |
OK, so this proposed approach would have
|
That's right. It means instead of "fixing" WriteGexf, I am adding this new approach, leaving the following possibilities.
|
This wouldn't truly effect AUK until there was a new release of AUT. That said, can you provide more detail as to what we'd have to do change the workflow? Would we still produce |
@ruebot The only change to aut should be that the aut command for The Graphpass workflow should remain the same. Alternately, I can make graphml the default WriteGraph behavior and then the only difference would be Basically, I chose this approach because I was duplicating code running from WriteGEXF and WriteGraphml and it started to seem that I should put them both together. |
@ruebot I realize I failed to answer your last questions. The graphs produced by graphpass include metadata about how to display the graph in Gephi or Sigma (ie. how big should the dots be, what color and where they should be positioned in the visualisation). Auk-produced graphs, unfortunately, just provide the raw network data with no visualization metadata. Currently, the best we have in auk right now is ExtractGraphX which produces metadata for node sizes and some things can offer a fair way to reduce the size of large graphs for visualization, but it would increase the amount of time it would take to produce derivatives for small graphs. When we accepted that udf into the repo, we decided it might be good for the toolkit, but it's not quite ready to help auk. |
Describe the Enhancement
AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.
To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.
Expected behavior
All network nodes should be unique.
Screenshots
N/A
Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex
.zipWithUniqueId() does not call another context so it could be faster.
See also #228
The text was updated successfully, but these errors were encountered: