Markdown linting (#50)

* Markdown linting * Fix typos, add line-length lint, and lint all current files.
archivesunleashed · Mar 29, 2020 · 6167c7b · 6167c7b
1 parent eb07ea7
commit 6167c7b
Show file tree

Hide file tree

Showing 16 changed files with 780 additions and 425 deletions.
diff --git a/.github/workflows/mdl-config.rb b/.github/workflows/mdl-config.rb
@@ -1,6 +1,6 @@
 all
 
-exclude_rule 'MD013'
+rule 'MD013', code_blocks: false, links: false, tables: false
 exclude_rule 'MD024'
 exclude_rule 'MD033'
 exclude_rule 'MD036'
diff --git a/current/README.md b/current/README.md
@@ -1,13 +1,28 @@
 # The Archives Unleashed Toolkit: Latest Documentation
 
-The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.
+The Archives Unleashed Toolkit is an open-source platform for analyzing web
+archives built on [Apache Spark](http://spark.apache.org/), which provides
+powerful tools for analytics and data processing.
+
+This documentation is based on a cookbook approach, providing a series of
+"recipes" for addressing a number of common analytics tasks to provide
+inspiration for your own analysis. We generally provide examples for [resilient
+distributed datasets
+(RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in
+Scala, and
+[DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes)
+in both Scala and Python. We leave it up to you to choose Scala or Python
+flavours of Spark.
+
+If you want to learn more about [Apache Spark](https://spark.apache.org/), we
+highly recommend [Spark: The Definitive
+Guide](http://shop.oreilly.com/product/0636920034957.do)
 
-This documentation is based on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks to provide inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.
-
-If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) 
 ## Table of Contents
 
-Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
+Our documentation is divided into several main sections, which cover the
+Archives Unleashed Toolkit workflow from analyzing collections to understanding
+and working with the results.
 
 ### Getting Started
 
@@ -24,7 +39,8 @@ Our documentation is divided into several main sections, which cover the Archive
   - [Extract Different Subdomains](collection-analysis.md#Extract-Different-Subdomains)
   - [Extract HTTP Status Codes](collection-analysis.md#Extract-HTTP-Status-Codes)
   - [Extract the Location of the Resource in ARCs and WARCs](collection-analysis.md#Extract-the-Location-of-the-Resource-in-ARCs-and-WARCs)
-- [**Text Analysis**](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md): How do I...
+- [**Text Analysis**](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md)
+  : How do I...
   - [Extract All Plain Text](text-analysis.md#Extract-All-Plain-Text)
   - [Extract Plain Text Without HTTP Headers](text-analysis.md#Extract-Plain-Text-Without-HTTP-Headers)
   - [Extract Plain Text By Domain](text-analysis.md#Extract-Plain-Text-By-Domain)
@@ -35,7 +51,8 @@ Our documentation is divided into several main sections, which cover the Archive
   - [Extract Plain Text Filtered by Keyword](text-analysis.md#Extract-Plain-Text-Filtered-by-Keyword)
   - [Extract Raw HTML](text-analysis.md#Extract-Raw-HTML)
   - [Extract Named Entities](text-analysis.md#Extract-Named-Entities)
-- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**: How do I...
+- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**
+  : How do I...
   - [Extract Simple Site Link Structure](link-analysis.md#Extract-Simple-Site-Link-Structure)
   - [Extract Raw URL Link Structure](link-analysis.md#Extract-Raw-URL-Link-Structure)
   - [Organize Links by URL Pattern](link-analysis.md#Organize-Links-by-URL-Pattern)
@@ -55,7 +72,7 @@ Our documentation is divided into several main sections, which cover the Archive
   - [Extract Spreadsheet Information](binary-analysis.md#Extract-Spreadsheet-Information)
   - [Extract Text File Information](binary-analysis.md#Extract-Text-Files-Information)
   - [Extract Video Information](binary-analysis.md#Extract-Video-Information)
-  - [Extract Word Processor File Information](binary-analysis.md#Extract-Word-Processor-Files-Information) 
+  - [Extract Word Processor File Information](binary-analysis.md#Extract-Word-Processor-Files-Information)
 
 ### Filtering Results
 
@@ -72,18 +89,37 @@ Our documentation is divided into several main sections, which cover the Archive
 
 ### What to do with Results
 
-- **[What to do with DataFrame Results](df-results.md)**: A variety of User Defined Functions for filters that can be used on any DataFrame column.
-- **[What to do with RDD Results](rdd-results.md)**: A variety of ways to filter RDD results 
+- **[What to do with DataFrame Results](df-results.md)**: A variety of User
+  Defined Functions for filters that can be used on any DataFrame column.
+- **[What to do with RDD Results](rdd-results.md)**: A variety of ways to
+  filter RDD results
 
 ## Further Reading
 
 The following two articles provide an overview of the project:
 
-+ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017.
-+ Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. [The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives](https://arxiv.org/abs/2001.05399). 2020.
+- Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable
+  Analytics Infrastructure for Exploring Web
+  Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on
+  Computing and Cultural Heritage_, 10(4), Article 22, 2017.
+- Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. [The Archives Unleashed
+  Project: Technology, Process, and Community to Improve Scholarly Access to
+  Web Archives](https://arxiv.org/abs/2001.05399). 2020.
 
 ## Acknowledgments
 
-This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).
-
-Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
+This work is primarily supported by the [Andrew W. Mellon
+Foundation](https://mellon.org/). Other financial and in-kind support comes
+from the [Social Sciences and Humanities Research
+Council](http://www.sshrc-crsh.gc.ca/), [Compute
+Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research,
+Innovation, and
+Science](https://www.ontario.ca/page/ministry-research-innovation-and-science),
+[York University Libraries](https://www.library.yorku.ca/web/), [Start Smart
+Labs](http://www.startsmartlabs.com/), the [Faculty of
+Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer
+Science](https://cs.uwaterloo.ca/) at the [University of
+Waterloo](https://uwaterloo.ca/).
+
+Any opinions, findings, and conclusions or recommendations expressed are those
+of the researchers and do not necessarily reflect the views of the sponsors.
diff --git a/current/aut-at-scale.md b/current/aut-at-scale.md
@@ -1,43 +1,68 @@
 # Using the Archives Unleashed Toolkit at Scale
 
-As your collections grow, you may need to provide more resources, and adjust Apache Spark configuration options. Apache Spark has a great [Configuration](https://spark.apache.org/docs/latest/configuration.html), and [Tuning](https://spark.apache.org/docs/latest/tuning.html) guides that are worth checking out. If you're not sure where to start with scaling, join us in [Slack](slack.archivesunleashed.org) in the `#aut` channel, and we might be able to provide some guidance.
+As your collections grow, you may need to provide more resources, and adjust
+Apache Spark configuration options. Apache Spark has a great
+[Configuration](https://spark.apache.org/docs/latest/configuration.html), and
+[Tuning](https://spark.apache.org/docs/latest/tuning.html) guides that are
+worth checking out. If you're not sure where to start with scaling, join us in
+[Slack](slack.archivesunleashed.org) in the `#aut` channel, and we might be
+able to provide some guidance.
 
 - [A Note on Memory and Cores](#A-Note-on-Memory-and-Cores)
 - [Reading Data from AWS S3](#Reading-Data-from-AWS-S3)
 
 ## A Note on Memory and Cores
 
-As your datasets grow, you may need to provide more memory to Apache Spark. You'll know this if you get an error saying that you have run out of "Java Heap Space."
+As your datasets grow, you may need to provide more memory to Apache Spark.
+You'll know this if you get an error saying that you have run out of "Java Heap
+Space."
 
-You can add a [configuration](https://spark.apache.org/docs/latest/configuration.html) option for adjusting available memory like so:
+You can add a
+[configuration](https://spark.apache.org/docs/latest/configuration.html) option
+for adjusting available memory like so:
 
 ```shell
-$ spark-shell --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
+spark-shell --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
 ```
 
 In the above case, you give Apache Spark 4GB of memory to execute the program.
 
-In some other cases, despite giving AUT sufficient memory, you may still encounter Java Heap Space issues. In those cases, it is worth trying to lower the number of worker threads. When running locally (i.e. on a single laptop, desktop, or server), by default AUT runs a number of threads equivalent to the number of cores in your machine.
+In some other cases, despite giving AUT sufficient memory, you may still
+encounter Java Heap Space issues. In those cases, it is worth trying to lower
+the number of worker threads. When running locally (i.e. on a single laptop,
+desktop, or server), by default AUT runs a number of threads equivalent to the
+number of cores in your machine.
 
-On a 16-core machine, you may want to drop to 12 cores if you are having memory issues. This will increase stability but decrease performance a bit.
+On a 16-core machine, you may want to drop to 12 cores if you are having memory
+issues. This will increase stability but decrease performance a bit.
 
 You can do so like this (example is using 12 threads on a 16-core machine):
 
 ```shell
-$ spark-shell --master local[12] --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
+spark-shell --master local[12] --driver-memory 4G --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
 ```
 
-If you continue to have errors, look at your output and logs. They will usually point you in the right direction. For instance, you may also need to increase the network timeout value. Once in a while, AUT might get stuck on an odd record and take longer than normal to process it. The `--conf spark.network.timeout=10000000` will ensure that AUT continues to work on material, although it may take a while to process. This command then works:
+If you continue to have errors, look at your output and logs. They will usually
+point you in the right direction. For instance, you may also need to increase
+the network timeout value. Once in a while, AUT might get stuck on an odd
+record and take longer than normal to process it. The `--conf
+spark.network.timeout=10000000` will ensure that AUT continues to work on
+material, although it may take a while to process. This command then works:
 
 ```shell
-$ spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
+spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
 ```
 
 ## Reading Data from AWS S3
 
-We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/). This advanced functionality requires that you provide Spark shell with your AWS Access Key and AWS Secret Key, which you will get when creating your AWS credentials ([read more here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
+We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
+This advanced functionality requires that you provide Spark shell with your AWS
+Access Key and AWS Secret Key, which you will get when creating your AWS
+credentials ([read more
+here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
 
-This script, for example, will find the top ten domains from a set of WARCs found in an s3 bucket.
+This script, for example, will find the top ten domains from a set of WARCs
+found in an s3 bucket.
 
 ```scala
 import io.archivesunleashed._
@@ -55,7 +80,10 @@ RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
 
 ### Reading Data from a S3-like Endpoint
 
-We also support loading data stored in an Amazon S3-like system such as [Ceph RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example, you'll need an access key and secret, and additionally you'll need to define your endpoint.
+We also support loading data stored in an Amazon S3-like system such as [Ceph
+RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
+you'll need an access key and secret, and additionally you'll need to define
+your endpoint.
 
 ```scala
 import io.archivesunleashed._
@@ -72,11 +100,11 @@ RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
   .take(10)
 ```
 
-### Troubleshooting S3 
+### Troubleshooting S3
 
 If you run into this `AmazonHttpClient` timeout error:
 
-```
+```shell
 19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
 org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
   at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)