-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
Comments
Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:
Do we really need all of these? I'd argue, at the very least, we can just remove
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf") These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation. $ wc -l *
29186 writegexf.gexf
29186 writegraph.gexf
58372 total |
For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one. I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts. FWIW I think we could also delete |
Describe the bug
The output of the
CommandLineApp
DomainGraphExtractor
creates different node ID types than runningWriteGraph
directly through spark shell. They should be the same.To Reproduce
The following command line command (both DF and RDD):
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1
creates an output file that looks like:
Conversely, if we run this script as per aut-docs:
We get an output that looks like:
Expected behavior
The output of
DomainGraphExtractor
is preferable to theWriteGraph
output. In other words, the nodes as hashes is superior to the notes as ID #s.Environment information
--jars
The text was updated successfully, but these errors were encountered: