Document the command line app.

- Resolves #14 - Documents archivesunleashed/aut#431
archivesunleashed · Apr 7, 2020 · e234eb0 · e234eb0
1 parent 6167c7b
commit e234eb0
Show file tree

Hide file tree

Showing 2 changed files with 129 additions and 0 deletions.
diff --git a/current/README.md b/current/README.md
@@ -86,6 +86,7 @@ and working with the results.
 - [Create the Archives Unleashed Cloud Scholarly Derivatives](standard-derivatives.md#Create-the-Archives-Unleashed-Scholarly-Derivatives)
 - [Extract Binary Info](standard-derivatives.md#Extract-Binary-Info)
 - [Extract Binaries to Disk](standard-derivatives.md#Extract-Binaries-to-Disk)
+- [Use the Toolkit with spark-submit](aut-spark-submit-app.md)
 
 ### What to do with Results
 

diff --git a/current/aut-spark-submit-app.md b/current/aut-spark-submit-app.md
@@ -0,0 +1,128 @@
+# Using the Toolkit with spark-submit
+
+The Toolkit offers a variety of extraction jobs with `spark-submit`. These
+extraction jobs have a few configuration options, and analysis can use RDD or
+DataFrame is most cases.
+
+The extration jobs have a basic outline of:
+
+`spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY`
+
+Additional flags include:
+
+* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
+  options are `TEXT` (default) or `GEXF`.)
+* `--df` (The extractor will use a DataFrame to carry out analysis.)
+* `--split` (The extractor will put results for each input file in its own
+  directory. Each directory name will be the name of the ARC/WARC file parsed.)
+* `--partition N` (The extractor will partition RDD or DataFrame according to N
+  before writing results. The is useful to combine all the results to a single
+  file.
+
+## Domain Frequency
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `domain`, and `count`.
+
+### RDD
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1`
+
+
+### DataFrame
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1`
+
+## Domain Graph
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `src_domain`, `dest_domain`, and `count`.
+
+### RDD
+
+**Note**: The RDD output is formatted slightly different. The first three
+columns are an array:
+`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`
+
+### DataFrame
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`
+
+## Image Graph
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.
+
+**Note**: This extractor will only work with the DataFrame option.
+
+### DataFrame
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`
+
+
+## Plain Text
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `domain`, `url`, and `text`.
+
+### RDD
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1`
+
+
+### DataFrame
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`
+
+## Web Pages
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `url`, `mime_type_web_server`,
+`mime_type_tika`, `language`, and `content`.
+
+**Note**: This extractor will only work with the DataFrame option.
+
+### DataFrame
+
+Directory of CSV files:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df`
+
+A single CSV file:
+
+* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1`