Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the command line app. #51

Merged
merged 3 commits into from
Apr 7, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions current/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ and working with the results.
- [Create the Archives Unleashed Cloud Scholarly Derivatives](standard-derivatives.md#Create-the-Archives-Unleashed-Scholarly-Derivatives)
- [Extract Binary Info](standard-derivatives.md#Extract-Binary-Info)
- [Extract Binaries to Disk](standard-derivatives.md#Extract-Binaries-to-Disk)
- [Use the Toolkit with spark-submit](aut-spark-submit-app.md)

### What to do with Results

Expand Down
157 changes: 157 additions & 0 deletions current/aut-spark-submit-app.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Using the Toolkit with spark-submit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these configuration options need to be used with a specific launch of the toolkit (e.g. package, uberjar, etc.)? At first glance, I guess I'm a little unsure of where to start or in terms of workflow, when this script would be introduced (e.g. use within or outside of sparkshell?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Toolkit offers a variety of extraction jobs with
[`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html)
. These extraction jobs have a few configuration options, and analysis can use
RDD or DataFrame is most cases.

The extraction jobs have a basic outline of:

```shell
spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY
```

Additional flags include:

* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
options are `TEXT` (default) or `GEXF`.)
* `--df` (The extractor will use a DataFrame to carry out analysis.)
* `--split` (The extractor will put results for each input file in its own
directory. Each directory name will be the name of the ARC/WARC file parsed.)
* `--partition N` (The extractor will partition RDD or DataFrame according to N
before writing results. The is useful to combine all the results to a single
file.)

## Domain Frequency

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `domain`, and `count`.

### RDD

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
```

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```

## Domain Graph

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `src_domain`, `dest_domain`, and `count`.

### RDD

**Note**: The RDD output is formatted slightly different. The first three
columns are an array:
`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`

### DataFrame

Directory of CSV files:
ruebot marked this conversation as resolved.
Show resolved Hide resolved

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```

## Image Graph

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.

**Note**: This extractor will only work with the DataFrame option.

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

``` shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```

## Plain Text

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `domain`, `url`, and `text`.

### RDD

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1
```

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```

## Web Pages

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `url`, `mime_type_web_server`,
`mime_type_tika`, `language`, and `content`.

**Note**: This extractor will only work with the DataFrame option.

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```