archivesunleashed · ianmilligan1 · Apr 7, 2020 · Apr 7, 2020 · Apr 7, 2020 · Apr 7, 2020
diff --git a/current/README.md b/current/README.md
@@ -86,6 +86,7 @@ and working with the results.
 - [Create the Archives Unleashed Cloud Scholarly Derivatives](standard-derivatives.md#Create-the-Archives-Unleashed-Scholarly-Derivatives)
 - [Extract Binary Info](standard-derivatives.md#Extract-Binary-Info)
 - [Extract Binaries to Disk](standard-derivatives.md#Extract-Binaries-to-Disk)
+- [Use the Toolkit with spark-submit](aut-spark-submit-app.md)
 
 ### What to do with Results
 

diff --git a/current/aut-spark-submit-app.md b/current/aut-spark-submit-app.md
@@ -0,0 +1,157 @@
+# Using the Toolkit with spark-submit
+
+The Toolkit offers a variety of extraction jobs with
+[`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html)
+. These extraction jobs have a few configuration options, and analysis can use
+RDD or DataFrame is most cases.
+
+The extraction jobs have a basic outline of:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY
+```
+
+Additional flags include:
+
+* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
+  options are `TEXT` (default) or `GEXF`.)
+* `--df` (The extractor will use a DataFrame to carry out analysis.)
+* `--split` (The extractor will put results for each input file in its own
+  directory. Each directory name will be the name of the ARC/WARC file parsed.)
+* `--partition N` (The extractor will partition RDD or DataFrame according to N
+  before writing results. The is useful to combine all the results to a single
+  file.)
+
+## Domain Frequency
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `domain`, and `count`.
+
+### RDD
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path
+ ```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
+```
+
+### DataFrame
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df
+```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+```
+
+## Domain Graph
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `src_domain`, `dest_domain`, and `count`.
+
+### RDD
+
+**Note**: The RDD output is formatted slightly different. The first three
+columns are an array:
+`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`
+
+### DataFrame
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
+```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+```
+
+## Image Graph
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.
+
+**Note**: This extractor will only work with the DataFrame option.
+
+### DataFrame
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
+```
+
+A single CSV file:
+
+``` shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+```
+
+## Plain Text
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `domain`, `url`, and `text`.
+
+### RDD
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path
+```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1
+```
+
+### DataFrame
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
+```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+```
+
+## Web Pages
+
+This extractor outputs a directory of CSV files or a single CSV file with the
+following columns: `crawl_date`, `url`, `mime_type_web_server`,
+`mime_type_tika`, `language`, and `content`.
+
+**Note**: This extractor will only work with the DataFrame option.
+
+### DataFrame
+
+Directory of CSV files:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df
+```
+
+A single CSV file:
+
+```shell
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+```