Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Closed
ianmilligan1 opened this issue Apr 9, 2020 · 2 comments
Assignees
Labels

Comments

@ianmilligan1
Copy link
Member

ianmilligan1 commented Apr 9, 2020

Describe the bug
The output of the CommandLineApp DomainGraphExtractor creates different node ID types than running WriteGraph directly through spark shell. They should be the same.

To Reproduce
The following command line command (both DF and RDD):

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1

creates an output file that looks like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Conversely, if we run this script as per aut-docs:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/users/ianmilligan1/desktop/script-gexf.gexf")

We get an output that looks like:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

Expected behavior
The output of DomainGraphExtractor is preferable to the WriteGraph output. In other words, the nodes as hashes is superior to the notes as ID #s.

Environment information

  • AUT version: Most recent master
  • OS: MacOS 15.4
  • Java version: Java 8
  • Apache Spark version: 2.4.4
  • Apache Spark w/aut: w/ --jars
  • Apache Spark command used to run AUT: see above
@ruebot
Copy link
Member

ruebot commented Apr 13, 2020

Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:

Do we really need all of these? I'd argue, at the very least, we can just remove WriteGraph since it is redundant.

WriteGraph

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")

WriteGEXF

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf")

These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation.

$ wc -l *              
  29186 writegexf.gexf
  29186 writegraph.gexf
  58372 total

@ianmilligan1
Copy link
Member Author

For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that WriteGraph uses zipWithUniqueIds and WriteGexf & WriteGraphml use ComputeMD5. There are pros and cons. WriteGraph is slower (@greebie thought 10-15% slower) but WriteGraph has the chance of an MD5 hash collision.

Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one.

I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts.

FWIW I think we could also delete WriteGraphXML - it looks to be a product of some of the GraphX experiments we were doing 2-3 years ago? reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants