Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

greebie · 2018-11-07T19:42:48Z

GitHub issue(s):

What does this Pull Request do?

This PR adds WriteGraph which generates GEXF and/or Graphml files.
There are also some additional utilities such as an id lookup.

This object generates proper unique ids. This differs from the current method in WriteGraphML and WriteGEXF that simply creates ids using an MD5 hash of the url. This method is better because the latter method could produce incorrect graphs in the case where hashes collide (eg. with very large graphs).

Timing tests with a medium-sized graph shows that it increases the processing time by 10-15%. However, it could reduce the size of network graph derivatives by an unknown margin (because numeric ids are smaller than hashes).

Example:
Old way produces:

<node id="405d19a958ba43d88e9edd7a77338aa3" label="laws.justice.gc.ca" />
<node id="69bde2cbb357119a50a950fca99a8341" label="english.uvic.ca" />
<node id="8687e4dc11548a5917504975feb7c649" label="oipc.bc.ca" />
<node id="30192bae759dafccc58bccc268c2b411" label="accuweather.com" />

New way:

<node id="0" label="laws.justice.gc.ca" />
<node id="38" label="english.uvic.ca" />
<node id="76" label="oipc.bc.ca" />
<node id="114" label="accuweather.com" />

How should this be tested?

Travis should pass.

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "new-gephi3.gexf") }

Should produce a Gexf that opens in Gephi.

Change last line to WriteGraph.asGraphml(links, "new-gephi3.gexf") to get graphml.

I have not tested it in GraphPass yet, but there is no reason why it should not work as directed.

Additional Notes:

WriteGraphmlwould replace WriteGEXF & WriteGraphml which can be deprecated.
CommandLineApp would also have to be changed before WriteGEXF etc. are removed.
I think aut had a previous WriteGraph udf which was deprecated and removed.
This applies only to the RDD graph functions. DF functions are not changed. (I need to review the way DFs do unique ids)
I added a node id lookup tool during the process of developing WriteGraph. It might be useful for the toolkit, but it can also be removed.

Interested parties

@lintool @ruebot @ianmilligan1

Thanks in advance for your help with the Archives Unleashed Toolkit!

codecov-io · 2018-11-07T20:02:24Z

Codecov Report

Merging #289 into master will increase coverage by 2.71%.
The diff coverage is 95.96%.

@@            Coverage Diff             @@
##           master     #289      +/-   ##
==========================================
+ Coverage   70.36%   73.07%   +2.71%     
==========================================
  Files          41       42       +1     
  Lines        1046     1170     +124     
  Branches      192      205      +13     
==========================================
+ Hits          736      855     +119     
- Misses        244      246       +2     
- Partials       66       69       +3

Impacted Files	Coverage Δ
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala	`100% <ø> (ø)`	⬆️
...in/scala/io/archivesunleashed/app/WriteGraph.scala	`95.96% <95.96%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6080a7...dbee737. Read the comment docs.

ianmilligan1

The GEXF generated by this works fine, but the GraphML won't work in Gephi. I used this script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "/Users/ianmilligan1/desktop/links-for-gephi.graphml")

I tested the same ARCs/WARCs in 0.17.0 and the GraphML works when I created it there with WriteGraphML, so it's something in the new function.

Here's the error in Gephi:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[387,35]
Message: Element type "edge" must be followed by either attribute specifications, ">" or "/>".
	at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ianmilligan1

Tested and the GraphML and GEXF files both work now.

One additional advantage: in Gephi, you sometimes have to provide the ID identifier (to create an ego network, for example). Right now, that involves copy and pasting a long hash which isn't trivial in their UI (requires a few clicks to copy-and-paste). Now, you could perhaps remember a three or four digit number and use accordingly.

Anyways, I think this is a good approach to have them in parallel at least for a little bit longer. And we can talk when we have a chance for a standup about future of the three functions.

greebie · 2018-11-08T14:56:42Z

Just a few additional notes here.

ExtractGraphX still uses hash and not .zipWithUniqueId(). It's possible that this graph generation approach is slightly faster due to optimization (when not calculating PageRank etc.). I'm going to explore more now that I'm coming closer to the end of term with teaching.

ianmilligan1 · 2018-11-08T15:00:23Z

I'm going to explore more now that I'm coming closer to the end of term with teaching.

Would that be in a separate PR or would it potentially affect this one?

greebie · 2018-11-08T15:10:23Z

I think it should be a separate PR, possibly referencing a different issue. I've been thinking about ExtractGraphX as a way to reduce the problems with GraphPass, but it's possible we could see some small efficiency gains just for regular aut production. I'd like to test that out.

I just wanted to include the note in this PR since the id approach is still the old one.

ianmilligan1 · 2018-11-08T18:03:33Z

OK sounds good @greebie. I'll let @lintool and @ruebot review this. From a user perspective, works for me.

src/main/scala/io/archivesunleashed/app/WriteGraph.scala

src/test/scala/io/archivesunleashed/app/WriteGraphTest.scala

lintool

minor suggestions, address and merge
I don't need to see it again.

ianmilligan1 · 2018-11-16T23:39:40Z

I got a failure running this on rho:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/geocities/warcs/indexed/GEOCITIES-20091029185858-00184-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/home/i2millig/results/new-gephi.gexf")

Did syntax change in the revisions or is there an issue? Error log is here but the tl;dr is:

scala.MatchError: ("http,((1042,110mb.com,20091029,9),None)) (of class scala.Tuple2)
        at io.archivesunleashed.app.WriteGraph$$anonfun$edgeNodes$5.apply(WriteGraph.scala:112)
        at io.archivesunleashed.app.WriteGraph$$anonfun$edgeNodes$5.apply(WriteGraph.scala:112)

The same file works with the older WriteGEXF(links, "/home/i2millig/results/new-gephi-old-command.gexf")

greebie · 2018-11-17T14:58:51Z

It looks like the problem is that on converting the edge tuples to ids, I did not escape the xml first. That's why the wonky "http caused problems. Can I borrow a copy of the geocities file so I can try to duplicate the problem and then test it for later versions?

- added xml escaping for edges. - added test case for non-escaped edges.

ianmilligan1 · 2018-11-19T14:50:00Z

Am sending the WARC to you via Slack, @greebie.

As we've now had a number of fails on this, could you also do a close line-by-line read of your new WriteGraph function vs WriteGEXF and WriteGraphML - the latter two are quite robust, so we'll want to make sure that the new one is too. i.e. a checklist of what WriteGEXF does and then making sure those are mapped over.

greebie · 2018-11-19T15:30:04Z

The bigger problem is that WriteGEXFTest and WriteGraphmlTest did not test for unescaped xml in the result. It tested the xml escaping function, but not that it was applied in the result. All these errors should be caught at compile time.

WriteGraphTest now tests for escaped xml in the result, which should prevent this from happening again.

greebie · 2018-11-19T15:50:44Z

Ran the same code on the new graph and it worked (after a java heap space error, but that's not related to this).

I also loaded the graph into gephi and it worked with no errors.

ianmilligan1 · 2018-11-19T16:08:02Z

OK worked here too. Can you confirm that there's nothing missing in this WriteGraph function that is present in the other two?

greebie · 2018-11-19T19:58:13Z

I reviewed the code and it works as expected, based on my knowledge of the circumstances. I have one more push, however, to include some additional tests to cover potential future problems.

… gexf files.

ianmilligan1 · 2018-11-19T22:21:21Z

Ok ready for me to review again on your end @greebie?

greebie · 2018-11-19T22:25:16Z

Yes. The tests should cover most of the situations now.

ianmilligan1

Looks good and works well. Tested on a large collection with:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/tuna1/scratch/i2milligan/results/new-gephi.gexf")

and then ran again but swapped out last line with:

WriteGraph.asGraphml(links, "/tuna1/scratch/i2milligan/results/new-gephi.graphml")

Same results as using the original WriteGEXF(links, "/tuna1/scratch/i2milligan/results/new-gephi-old-command.gexf") command but without the long node IDs, more legible now.

Over to @ruebot for final review.

ruebot · 2018-11-21T23:23:53Z

@greebie are you done adding code to this PR? We've tested this multiple times, done code review, and signed off, and things keep on getting added.

greebie · 2018-11-22T01:22:13Z

Done. Last commit was an attempt to improve coverage a bit.

greebie added 4 commits November 7, 2018 10:00

Create GEXF with proper ids instead of hash to avoid collisions.

a83e961

Fix gephi errors.

12293cf

Re-add WriteGEXF files.

b048968

Add WriteGraph file and test.

247d349

ianmilligan1 requested changes Nov 7, 2018

View reviewed changes

greebie added 2 commits November 7, 2018 16:15

Add test for Graphml output.

2540048

Fix tests (forgot to save file).

3cd1b66

ianmilligan1 approved these changes Nov 8, 2018

View reviewed changes

ianmilligan1 requested review from ruebot and lintool November 8, 2018 15:51

ruebot mentioned this pull request Nov 12, 2018

Update aut documentation for https://github.com/archivesunleashed/aut/pull/289 archivesunleashed/archivesunleashed.org#71

Closed