-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289
Conversation
Codecov Report
@@ Coverage Diff @@
## master #289 +/- ##
==========================================
+ Coverage 70.36% 73.07% +2.71%
==========================================
Files 41 42 +1
Lines 1046 1170 +124
Branches 192 205 +13
==========================================
+ Hits 736 855 +119
- Misses 244 246 +2
- Partials 66 69 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GEXF generated by this works fine, but the GraphML won't work in Gephi. I used this script:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph.asGraphml(links, "/Users/ianmilligan1/desktop/links-for-gephi.graphml")
I tested the same ARCs/WARCs in 0.17.0 and the GraphML works when I created it there with WriteGraphML
, so it's something in the new function.
Here's the error in Gephi:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[387,35]
Message: Element type "edge" must be followed by either attribute specifications, ">" or "/>".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and the GraphML and GEXF files both work now.
One additional advantage: in Gephi, you sometimes have to provide the ID identifier (to create an ego network, for example). Right now, that involves copy and pasting a long hash which isn't trivial in their UI (requires a few clicks to copy-and-paste). Now, you could perhaps remember a three or four digit number and use accordingly.
Anyways, I think this is a good approach to have them in parallel at least for a little bit longer. And we can talk when we have a chance for a standup about future of the three functions.
Just a few additional notes here. ExtractGraphX still uses hash and not .zipWithUniqueId(). It's possible that this graph generation approach is slightly faster due to optimization (when not calculating PageRank etc.). I'm going to explore more now that I'm coming closer to the end of term with teaching. |
Would that be in a separate PR or would it potentially affect this one? |
I think it should be a separate PR, possibly referencing a different issue. I've been thinking about ExtractGraphX as a way to reduce the problems with GraphPass, but it's possible we could see some small efficiency gains just for regular aut production. I'd like to test that out. I just wanted to include the note in this PR since the id approach is still the old one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestions, address and merge
I don't need to see it again.
I got a failure running this on rho: import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/geocities/warcs/indexed/GEOCITIES-20091029185858-00184-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/home/i2millig/results/new-gephi.gexf") Did syntax change in the revisions or is there an issue? Error log is here but the tl;dr is:
The same file works with the older |
It looks like the problem is that on converting the edge tuples to ids, I did not escape the xml first. That's why the wonky |
- added xml escaping for edges. - added test case for non-escaped edges.
Am sending the WARC to you via Slack, @greebie. As we've now had a number of fails on this, could you also do a close line-by-line read of your new |
The bigger problem is that WriteGEXFTest and WriteGraphmlTest did not test for unescaped xml in the result. It tested the xml escaping function, but not that it was applied in the result. All these errors should be caught at compile time. WriteGraphTest now tests for escaped xml in the result, which should prevent this from happening again. |
Ran the same code on the new graph and it worked (after a java heap space error, but that's not related to this). I also loaded the graph into gephi and it worked with no errors. |
OK worked here too. Can you confirm that there's nothing missing in this |
I reviewed the code and it works as expected, based on my knowledge of the circumstances. I have one more push, however, to include some additional tests to cover potential future problems. |
Ok ready for me to review again on your end @greebie? |
Yes. The tests should cover most of the situations now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and works well. Tested on a large collection with:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/tuna1/scratch/i2milligan/results/new-gephi.gexf")
and then ran again but swapped out last line with:
WriteGraph.asGraphml(links, "/tuna1/scratch/i2milligan/results/new-gephi.graphml")
Same results as using the original WriteGEXF(links, "/tuna1/scratch/i2milligan/results/new-gephi-old-command.gexf")
command but without the long node IDs, more legible now.
Over to @ruebot for final review.
@greebie are you done adding code to this PR? We've tested this multiple times, done code review, and signed off, and things keep on getting added. |
Done. Last commit was an attempt to improve coverage a bit. |
GitHub issue(s):
#243
What does this Pull Request do?
This PR adds WriteGraph which generates GEXF and/or Graphml files.
There are also some additional utilities such as an id lookup.
This object generates proper unique ids. This differs from the current method in WriteGraphML and WriteGEXF that simply creates ids using an MD5 hash of the url. This method is better because the latter method could produce incorrect graphs in the case where hashes collide (eg. with very large graphs).
Timing tests with a medium-sized graph shows that it increases the processing time by 10-15%. However, it could reduce the size of network graph derivatives by an unknown margin (because numeric ids are smaller than hashes).
Example:
Old way produces:
New way:
How should this be tested?
Should produce a Gexf that opens in Gephi.
Change last line to
WriteGraph.asGraphml(links, "new-gephi3.gexf")
to get graphml.I have not tested it in GraphPass yet, but there is no reason why it should not work as directed.
Additional Notes:
WriteGraphml
would replace WriteGEXF & WriteGraphml which can be deprecated.CommandLineApp
would also have to be changed before WriteGEXF etc. are removed.WriteGraph
udf which was deprecated and removed.Interested parties
@lintool @ruebot @ianmilligan1
Thanks in advance for your help with the Archives Unleashed Toolkit!