Skip to content

Commit

Permalink
Document the command line app.
Browse files Browse the repository at this point in the history
- Resolves #14
- Documents archivesunleashed/aut#431
  • Loading branch information
ruebot committed Apr 7, 2020
1 parent 6167c7b commit e234eb0
Show file tree
Hide file tree
Showing 2 changed files with 129 additions and 0 deletions.
1 change: 1 addition & 0 deletions current/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ and working with the results.
- [Create the Archives Unleashed Cloud Scholarly Derivatives](standard-derivatives.md#Create-the-Archives-Unleashed-Scholarly-Derivatives)
- [Extract Binary Info](standard-derivatives.md#Extract-Binary-Info)
- [Extract Binaries to Disk](standard-derivatives.md#Extract-Binaries-to-Disk)
- [Use the Toolkit with spark-submit](aut-spark-submit-app.md)

### What to do with Results

Expand Down
128 changes: 128 additions & 0 deletions current/aut-spark-submit-app.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Using the Toolkit with spark-submit

The Toolkit offers a variety of extraction jobs with `spark-submit`. These
extraction jobs have a few configuration options, and analysis can use RDD or
DataFrame is most cases.

The extration jobs have a basic outline of:

`spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY`

Additional flags include:

* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
options are `TEXT` (default) or `GEXF`.)
* `--df` (The extractor will use a DataFrame to carry out analysis.)
* `--split` (The extractor will put results for each input file in its own
directory. Each directory name will be the name of the ARC/WARC file parsed.)
* `--partition N` (The extractor will partition RDD or DataFrame according to N
before writing results. The is useful to combine all the results to a single
file.

## Domain Frequency

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `domain`, and `count`.

### RDD

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1`


### DataFrame

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1`

## Domain Graph

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `src_domain`, `dest_domain`, and `count`.

### RDD

**Note**: The RDD output is formatted slightly different. The first three
columns are an array:
`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`

### DataFrame
Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`

## Image Graph

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.

**Note**: This extractor will only work with the DataFrame option.

### DataFrame

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`


## Plain Text

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `domain`, `url`, and `text`.

### RDD

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1`


### DataFrame

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1`

## Web Pages

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `url`, `mime_type_web_server`,
`mime_type_tika`, `language`, and `content`.

**Note**: This extractor will only work with the DataFrame option.

### DataFrame

Directory of CSV files:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df`

A single CSV file:

* `spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1`

0 comments on commit e234eb0

Please sign in to comment.