Skip to content

Commit

Permalink
Markdown linting (#50)
Browse files Browse the repository at this point in the history
* Markdown linting
* Fix typos, add line-length lint, and lint all current files.
  • Loading branch information
ruebot authored Mar 29, 2020
1 parent eb07ea7 commit 6167c7b
Show file tree
Hide file tree
Showing 16 changed files with 780 additions and 425 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/mdl-config.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
all

exclude_rule 'MD013'
rule 'MD013', code_blocks: false, links: false, tables: false
exclude_rule 'MD024'
exclude_rule 'MD033'
exclude_rule 'MD036'
66 changes: 51 additions & 15 deletions current/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,28 @@
# The Archives Unleashed Toolkit: Latest Documentation

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.
The Archives Unleashed Toolkit is an open-source platform for analyzing web
archives built on [Apache Spark](http://spark.apache.org/), which provides
powerful tools for analytics and data processing.

This documentation is based on a cookbook approach, providing a series of
"recipes" for addressing a number of common analytics tasks to provide
inspiration for your own analysis. We generally provide examples for [resilient
distributed datasets
(RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in
Scala, and
[DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes)
in both Scala and Python. We leave it up to you to choose Scala or Python
flavours of Spark.

If you want to learn more about [Apache Spark](https://spark.apache.org/), we
highly recommend [Spark: The Definitive
Guide](http://shop.oreilly.com/product/0636920034957.do)

This documentation is based on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks to provide inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.

If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)
## Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
Our documentation is divided into several main sections, which cover the
Archives Unleashed Toolkit workflow from analyzing collections to understanding
and working with the results.

### Getting Started

Expand All @@ -24,7 +39,8 @@ Our documentation is divided into several main sections, which cover the Archive
- [Extract Different Subdomains](collection-analysis.md#Extract-Different-Subdomains)
- [Extract HTTP Status Codes](collection-analysis.md#Extract-HTTP-Status-Codes)
- [Extract the Location of the Resource in ARCs and WARCs](collection-analysis.md#Extract-the-Location-of-the-Resource-in-ARCs-and-WARCs)
- [**Text Analysis**](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md): How do I...
- [**Text Analysis**](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md)
: How do I...
- [Extract All Plain Text](text-analysis.md#Extract-All-Plain-Text)
- [Extract Plain Text Without HTTP Headers](text-analysis.md#Extract-Plain-Text-Without-HTTP-Headers)
- [Extract Plain Text By Domain](text-analysis.md#Extract-Plain-Text-By-Domain)
Expand All @@ -35,7 +51,8 @@ Our documentation is divided into several main sections, which cover the Archive
- [Extract Plain Text Filtered by Keyword](text-analysis.md#Extract-Plain-Text-Filtered-by-Keyword)
- [Extract Raw HTML](text-analysis.md#Extract-Raw-HTML)
- [Extract Named Entities](text-analysis.md#Extract-Named-Entities)
- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**: How do I...
- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**
: How do I...
- [Extract Simple Site Link Structure](link-analysis.md#Extract-Simple-Site-Link-Structure)
- [Extract Raw URL Link Structure](link-analysis.md#Extract-Raw-URL-Link-Structure)
- [Organize Links by URL Pattern](link-analysis.md#Organize-Links-by-URL-Pattern)
Expand All @@ -55,7 +72,7 @@ Our documentation is divided into several main sections, which cover the Archive
- [Extract Spreadsheet Information](binary-analysis.md#Extract-Spreadsheet-Information)
- [Extract Text File Information](binary-analysis.md#Extract-Text-Files-Information)
- [Extract Video Information](binary-analysis.md#Extract-Video-Information)
- [Extract Word Processor File Information](binary-analysis.md#Extract-Word-Processor-Files-Information)
- [Extract Word Processor File Information](binary-analysis.md#Extract-Word-Processor-Files-Information)

### Filtering Results

Expand All @@ -72,18 +89,37 @@ Our documentation is divided into several main sections, which cover the Archive

### What to do with Results

- **[What to do with DataFrame Results](df-results.md)**: A variety of User Defined Functions for filters that can be used on any DataFrame column.
- **[What to do with RDD Results](rdd-results.md)**: A variety of ways to filter RDD results
- **[What to do with DataFrame Results](df-results.md)**: A variety of User
Defined Functions for filters that can be used on any DataFrame column.
- **[What to do with RDD Results](rdd-results.md)**: A variety of ways to
filter RDD results

## Further Reading

The following two articles provide an overview of the project:

+ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017.
+ Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. [The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives](https://arxiv.org/abs/2001.05399). 2020.
- Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable
Analytics Infrastructure for Exploring Web
Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on
Computing and Cultural Heritage_, 10(4), Article 22, 2017.
- Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. [The Archives Unleashed
Project: Technology, Process, and Community to Improve Scholarly Access to
Web Archives](https://arxiv.org/abs/2001.05399). 2020.

## Acknowledgments

This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
This work is primarily supported by the [Andrew W. Mellon
Foundation](https://mellon.org/). Other financial and in-kind support comes
from the [Social Sciences and Humanities Research
Council](http://www.sshrc-crsh.gc.ca/), [Compute
Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research,
Innovation, and
Science](https://www.ontario.ca/page/ministry-research-innovation-and-science),
[York University Libraries](https://www.library.yorku.ca/web/), [Start Smart
Labs](http://www.startsmartlabs.com/), the [Faculty of
Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer
Science](https://cs.uwaterloo.ca/) at the [University of
Waterloo](https://uwaterloo.ca/).

Any opinions, findings, and conclusions or recommendations expressed are those
of the researchers and do not necessarily reflect the views of the sponsors.
56 changes: 42 additions & 14 deletions current/aut-at-scale.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,68 @@
# Using the Archives Unleashed Toolkit at Scale

As your collections grow, you may need to provide more resources, and adjust Apache Spark configuration options. Apache Spark has a great [Configuration](https://spark.apache.org/docs/latest/configuration.html), and [Tuning](https://spark.apache.org/docs/latest/tuning.html) guides that are worth checking out. If you're not sure where to start with scaling, join us in [Slack](slack.archivesunleashed.org) in the `#aut` channel, and we might be able to provide some guidance.
As your collections grow, you may need to provide more resources, and adjust
Apache Spark configuration options. Apache Spark has a great
[Configuration](https://spark.apache.org/docs/latest/configuration.html), and
[Tuning](https://spark.apache.org/docs/latest/tuning.html) guides that are
worth checking out. If you're not sure where to start with scaling, join us in
[Slack](slack.archivesunleashed.org) in the `#aut` channel, and we might be
able to provide some guidance.

- [A Note on Memory and Cores](#A-Note-on-Memory-and-Cores)
- [Reading Data from AWS S3](#Reading-Data-from-AWS-S3)

## A Note on Memory and Cores

As your datasets grow, you may need to provide more memory to Apache Spark. You'll know this if you get an error saying that you have run out of "Java Heap Space."
As your datasets grow, you may need to provide more memory to Apache Spark.
You'll know this if you get an error saying that you have run out of "Java Heap
Space."

You can add a [configuration](https://spark.apache.org/docs/latest/configuration.html) option for adjusting available memory like so:
You can add a
[configuration](https://spark.apache.org/docs/latest/configuration.html) option
for adjusting available memory like so:

```shell
$ spark-shell --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
spark-shell --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
```

In the above case, you give Apache Spark 4GB of memory to execute the program.

In some other cases, despite giving AUT sufficient memory, you may still encounter Java Heap Space issues. In those cases, it is worth trying to lower the number of worker threads. When running locally (i.e. on a single laptop, desktop, or server), by default AUT runs a number of threads equivalent to the number of cores in your machine.
In some other cases, despite giving AUT sufficient memory, you may still
encounter Java Heap Space issues. In those cases, it is worth trying to lower
the number of worker threads. When running locally (i.e. on a single laptop,
desktop, or server), by default AUT runs a number of threads equivalent to the
number of cores in your machine.

On a 16-core machine, you may want to drop to 12 cores if you are having memory issues. This will increase stability but decrease performance a bit.
On a 16-core machine, you may want to drop to 12 cores if you are having memory
issues. This will increase stability but decrease performance a bit.

You can do so like this (example is using 12 threads on a 16-core machine):

```shell
$ spark-shell --master local[12] --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
spark-shell --master local[12] --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
```

If you continue to have errors, look at your output and logs. They will usually point you in the right direction. For instance, you may also need to increase the network timeout value. Once in a while, AUT might get stuck on an odd record and take longer than normal to process it. The `--conf spark.network.timeout=10000000` will ensure that AUT continues to work on material, although it may take a while to process. This command then works:
If you continue to have errors, look at your output and logs. They will usually
point you in the right direction. For instance, you may also need to increase
the network timeout value. Once in a while, AUT might get stuck on an odd
record and take longer than normal to process it. The `--conf
spark.network.timeout=10000000` will ensure that AUT continues to work on
material, although it may take a while to process. This command then works:

```shell
$ spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
```

## Reading Data from AWS S3

We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/). This advanced functionality requires that you provide Spark shell with your AWS Access Key and AWS Secret Key, which you will get when creating your AWS credentials ([read more here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
This advanced functionality requires that you provide Spark shell with your AWS
Access Key and AWS Secret Key, which you will get when creating your AWS
credentials ([read more
here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).

This script, for example, will find the top ten domains from a set of WARCs found in an s3 bucket.
This script, for example, will find the top ten domains from a set of WARCs
found in an s3 bucket.

```scala
import io.archivesunleashed._
Expand All @@ -55,7 +80,10 @@ RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)

### Reading Data from a S3-like Endpoint

We also support loading data stored in an Amazon S3-like system such as [Ceph RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example, you'll need an access key and secret, and additionally you'll need to define your endpoint.
We also support loading data stored in an Amazon S3-like system such as [Ceph
RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
you'll need an access key and secret, and additionally you'll need to define
your endpoint.

```scala
import io.archivesunleashed._
Expand All @@ -72,11 +100,11 @@ RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.take(10)
```

### Troubleshooting S3
### Troubleshooting S3

If you run into this `AmazonHttpClient` timeout error:

```
```shell
19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
Expand Down
Loading

0 comments on commit 6167c7b

Please sign in to comment.