Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codify creation of standard derivatives into apps #195

Closed
lintool opened this issue Apr 9, 2018 · 3 comments
Closed

Codify creation of standard derivatives into apps #195

lintool opened this issue Apr 9, 2018 · 3 comments

Comments

@lintool
Copy link
Member

lintool commented Apr 9, 2018

According to Slack discussions with @ianmilligan1 creation of standard derivates is still based on scripting AUT commands in Spark shell. We should probably codify into "apps" that can be called via spark-submit. This would also enable better e2e integration testing.

@ianmilligan1 why don't we start with the simplest such scripts that you're building - and I can start by mocking up what it would look like as an app.

@ianmilligan1
Copy link
Member

Great, thanks @lintool – this is a great idea.

Here's the script that we run for each basic AUK job, with the updated syntax for your refactored version of AUT.

      import io.archivesunleashed._
      import io.archivesunleashed.app._
      import io.archivesunleashed.matchbox._
      sc.setLogLevel("INFO")
      RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("#{collection_derivatives}/all-domains/output")
      RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("#{collection_derivatives}/all-text/output")
      val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
      WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")
      sys.exit

Basically, you'll see it

  1. launches
  2. generates a frequency list of domains
  3. generates a full text dump
  4. generates a domain-to-domain hyperlink network in GraphML format

Is this enough or do you want me to break this out more?

@lintool
Copy link
Member Author

lintool commented May 13, 2018

It probably makes sense to break down into three jobs:

  1. io.archivesunleashed.app.DomainFrequencyExtractor
  2. io.archivesunleashed.app.PlainTextExtractor
  3. io.archivesunleashed.app.DomainGraphExtractor

Let's use Scallop for command-line args, e.g., https://github.com/lintool/bespin/blob/master/src/main/scala/io/bespin/scala/spark/wordcount/WordCount.scala

We'll need test cases built against src/test/resources/warc/example.warc.gz.

@ianmilligan1
Copy link
Member

Closed in #222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants