diff --git a/current/README.md b/current/README.md index f6ff9e2..e56c1ef 100644 --- a/current/README.md +++ b/current/README.md @@ -86,6 +86,7 @@ and working with the results. - [Create the Archives Unleashed Cloud Scholarly Derivatives](standard-derivatives.md#Create-the-Archives-Unleashed-Scholarly-Derivatives) - [Extract Binary Info](standard-derivatives.md#Extract-Binary-Info) - [Extract Binaries to Disk](standard-derivatives.md#Extract-Binaries-to-Disk) +- [Use the Toolkit with spark-submit](aut-spark-submit-app.md) ### What to do with Results diff --git a/current/aut-spark-submit-app.md b/current/aut-spark-submit-app.md new file mode 100644 index 0000000..78b3231 --- /dev/null +++ b/current/aut-spark-submit-app.md @@ -0,0 +1,128 @@ +# Using the Toolkit with spark-submit + +The Toolkit offers a variety of extraction jobs with `spark-submit`. These +extraction jobs have a few configuration options, and analysis can use RDD or +DataFrame is most cases. + +The extration jobs have a basic outline of: + +`spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY` + +Additional flags include: + +* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the + options are `TEXT` (default) or `GEXF`.) +* `--df` (The extractor will use a DataFrame to carry out analysis.) +* `--split` (The extractor will put results for each input file in its own + directory. Each directory name will be the name of the ARC/WARC file parsed.) +* `--partition N` (The extractor will partition RDD or DataFrame according to N + before writing results. The is useful to combine all the results to a single + file. + +## Domain Frequency + +This extractor outputs a directory of CSV files or a single CSV file with the +following columns: `domain`, and `count`. + +### RDD + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1` + + +### DataFrame + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1` + +## Domain Graph + +This extractor outputs a directory of CSV files or a single CSV file with the +following columns: `crawl_date`, `src_domain`, `dest_domain`, and `count`. + +### RDD + +**Note**: The RDD output is formatted slightly different. The first three +columns are an array: +`((CrawlDate, SourceDomain, DestinationDomain), Frequency)` + +### DataFrame +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1` + +## Image Graph + +This extractor outputs a directory of CSV files or a single CSV file with the +following columns: `crawl_date`, `src`, `image_url`, and `alt_text`. + +**Note**: This extractor will only work with the DataFrame option. + +### DataFrame + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1` + + +## Plain Text + +This extractor outputs a directory of CSV files or a single CSV file with the +following columns: `crawl_date`, `domain`, `url`, and `text`. + +### RDD + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1` + + +### DataFrame + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1` + +## Web Pages + +This extractor outputs a directory of CSV files or a single CSV file with the +following columns: `crawl_date`, `url`, `mime_type_web_server`, +`mime_type_tika`, `language`, and `content`. + +**Note**: This extractor will only work with the DataFrame option. + +### DataFrame + +Directory of CSV files: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df` + +A single CSV file: + +* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1`