Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

Merged
merged 11 commits into from
Nov 22, 2018

Conversation

greebie
Copy link
Contributor

@greebie greebie commented Nov 7, 2018

GitHub issue(s):

#243

What does this Pull Request do?

This PR adds WriteGraph which generates GEXF and/or Graphml files.
There are also some additional utilities such as an id lookup.

This object generates proper unique ids. This differs from the current method in WriteGraphML and WriteGEXF that simply creates ids using an MD5 hash of the url. This method is better because the latter method could produce incorrect graphs in the case where hashes collide (eg. with very large graphs).

Timing tests with a medium-sized graph shows that it increases the processing time by 10-15%. However, it could reduce the size of network graph derivatives by an unknown margin (because numeric ids are smaller than hashes).

Example:
Old way produces:

<node id="405d19a958ba43d88e9edd7a77338aa3" label="laws.justice.gc.ca" />
<node id="69bde2cbb357119a50a950fca99a8341" label="english.uvic.ca" />
<node id="8687e4dc11548a5917504975feb7c649" label="oipc.bc.ca" />
<node id="30192bae759dafccc58bccc268c2b411" label="accuweather.com" />

New way:

<node id="0" label="laws.justice.gc.ca" />
<node id="38" label="english.uvic.ca" />
<node id="76" label="oipc.bc.ca" />
<node id="114" label="accuweather.com" />

How should this be tested?

  • Travis should pass.
def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "new-gephi3.gexf") }

Should produce a Gexf that opens in Gephi.

Change last line to WriteGraph.asGraphml(links, "new-gephi3.gexf") to get graphml.

I have not tested it in GraphPass yet, but there is no reason why it should not work as directed.

Additional Notes:

  • WriteGraphmlwould replace WriteGEXF & WriteGraphml which can be deprecated.
  • CommandLineApp would also have to be changed before WriteGEXF etc. are removed.
  • I think aut had a previous WriteGraph udf which was deprecated and removed.
  • This applies only to the RDD graph functions. DF functions are not changed. (I need to review the way DFs do unique ids)
  • I added a node id lookup tool during the process of developing WriteGraph. It might be useful for the toolkit, but it can also be removed.

Interested parties

@lintool @ruebot @ianmilligan1

Thanks in advance for your help with the Archives Unleashed Toolkit!

@codecov-io
Copy link

codecov-io commented Nov 7, 2018

Codecov Report

Merging #289 into master will increase coverage by 2.71%.
The diff coverage is 95.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #289      +/-   ##
==========================================
+ Coverage   70.36%   73.07%   +2.71%     
==========================================
  Files          41       42       +1     
  Lines        1046     1170     +124     
  Branches      192      205      +13     
==========================================
+ Hits          736      855     +119     
- Misses        244      246       +2     
- Partials       66       69       +3
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala 100% <ø> (ø) ⬆️
...in/scala/io/archivesunleashed/app/WriteGraph.scala 95.96% <95.96%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6080a7...dbee737. Read the comment docs.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GEXF generated by this works fine, but the GraphML won't work in Gephi. I used this script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "/Users/ianmilligan1/desktop/links-for-gephi.graphml")

I tested the same ARCs/WARCs in 0.17.0 and the GraphML works when I created it there with WriteGraphML, so it's something in the new function.

Here's the error in Gephi:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[387,35]
Message: Element type "edge" must be followed by either attribute specifications, ">" or "/>".
	at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and the GraphML and GEXF files both work now.

One additional advantage: in Gephi, you sometimes have to provide the ID identifier (to create an ego network, for example). Right now, that involves copy and pasting a long hash which isn't trivial in their UI (requires a few clicks to copy-and-paste). Now, you could perhaps remember a three or four digit number and use accordingly.

Anyways, I think this is a good approach to have them in parallel at least for a little bit longer. And we can talk when we have a chance for a standup about future of the three functions.

@greebie
Copy link
Contributor Author

greebie commented Nov 8, 2018

Just a few additional notes here.

ExtractGraphX still uses hash and not .zipWithUniqueId(). It's possible that this graph generation approach is slightly faster due to optimization (when not calculating PageRank etc.). I'm going to explore more now that I'm coming closer to the end of term with teaching.

@ianmilligan1
Copy link
Member

I'm going to explore more now that I'm coming closer to the end of term with teaching.

Would that be in a separate PR or would it potentially affect this one?

@greebie
Copy link
Contributor Author

greebie commented Nov 8, 2018

I think it should be a separate PR, possibly referencing a different issue. I've been thinking about ExtractGraphX as a way to reduce the problems with GraphPass, but it's possible we could see some small efficiency gains just for regular aut production. I'd like to test that out.

I just wanted to include the note in this PR since the id approach is still the old one.

@ianmilligan1
Copy link
Member

OK sounds good @greebie. I'll let @lintool and @ruebot review this. From a user perspective, works for me.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestions, address and merge
I don't need to see it again.

@ianmilligan1
Copy link
Member

ianmilligan1 commented Nov 16, 2018

I got a failure running this on rho:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/geocities/warcs/indexed/GEOCITIES-20091029185858-00184-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/home/i2millig/results/new-gephi.gexf")

Did syntax change in the revisions or is there an issue? Error log is here but the tl;dr is:

scala.MatchError: ("http,((1042,110mb.com,20091029,9),None)) (of class scala.Tuple2)
        at io.archivesunleashed.app.WriteGraph$$anonfun$edgeNodes$5.apply(WriteGraph.scala:112)
        at io.archivesunleashed.app.WriteGraph$$anonfun$edgeNodes$5.apply(WriteGraph.scala:112)

The same file works with the older WriteGEXF(links, "/home/i2millig/results/new-gephi-old-command.gexf")

@greebie
Copy link
Contributor Author

greebie commented Nov 17, 2018

It looks like the problem is that on converting the edge tuples to ids, I did not escape the xml first. That's why the wonky "http caused problems. Can I borrow a copy of the geocities file so I can try to duplicate the problem and then test it for later versions?

- added xml escaping for edges.
- added test case for non-escaped edges.
@ianmilligan1
Copy link
Member

Am sending the WARC to you via Slack, @greebie.

As we've now had a number of fails on this, could you also do a close line-by-line read of your new WriteGraph function vs WriteGEXF and WriteGraphML - the latter two are quite robust, so we'll want to make sure that the new one is too. i.e. a checklist of what WriteGEXF does and then making sure those are mapped over.

@greebie
Copy link
Contributor Author

greebie commented Nov 19, 2018

The bigger problem is that WriteGEXFTest and WriteGraphmlTest did not test for unescaped xml in the result. It tested the xml escaping function, but not that it was applied in the result. All these errors should be caught at compile time.

WriteGraphTest now tests for escaped xml in the result, which should prevent this from happening again.

@greebie
Copy link
Contributor Author

greebie commented Nov 19, 2018

Ran the same code on the new graph and it worked (after a java heap space error, but that's not related to this).

I also loaded the graph into gephi and it worked with no errors.

@ianmilligan1
Copy link
Member

OK worked here too. Can you confirm that there's nothing missing in this WriteGraph function that is present in the other two?

@greebie
Copy link
Contributor Author

greebie commented Nov 19, 2018

I reviewed the code and it works as expected, based on my knowledge of the circumstances. I have one more push, however, to include some additional tests to cover potential future problems.

@ianmilligan1
Copy link
Member

Ok ready for me to review again on your end @greebie?

@greebie
Copy link
Contributor Author

greebie commented Nov 19, 2018

Yes. The tests should cover most of the situations now.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works well. Tested on a large collection with:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/tuna1/scratch/i2milligan/results/new-gephi.gexf")

and then ran again but swapped out last line with:

WriteGraph.asGraphml(links, "/tuna1/scratch/i2milligan/results/new-gephi.graphml")

Same results as using the original WriteGEXF(links, "/tuna1/scratch/i2milligan/results/new-gephi-old-command.gexf") command but without the long node IDs, more legible now.

Over to @ruebot for final review.

@ruebot
Copy link
Member

ruebot commented Nov 21, 2018

@greebie are you done adding code to this PR? We've tested this multiple times, done code review, and signed off, and things keep on getting added.

@greebie
Copy link
Contributor Author

greebie commented Nov 22, 2018

Done. Last commit was an attempt to improve coverage a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants