Skip to content

Collecting and sharing mailing list archives

Niels ten Oever edited this page Mar 17, 2018 · 6 revisions

This page collects documentation on how to use and contribute to repositories of large mailing list archives.

Available archives

Archives may not be publicly accessible (for privacy purposes, to avoid large data transfer costs), but if you're doing research on this, just reach out to the contact and they'll provide access.

full ietf-archives; Contact: @nllz

If you have collected a set of mailing list archives for some organization, feel free to add them here.

git-lfs

Mailing list archive repositories use git-lfs to store the contents of the .mail and other large files. So first, install git-lfs either with brew install git-lfs or with the operating-system-specific latest release.

Configure git-lfs with this command:

git lfs install --force --skip-smudge

You can also just run git lfs install but in order to make the first clone finish in a reasonable amount of time, you don't want git lfs to run as part of the smudge step.

GitHub runs a Git LFS server, but the storage and download bandwidth cost extra, so don't run these steps on a whim.

git-lfs tracking

Currently, git-lfs is used for:

.mbox
.mail
.txt.gz

If you are going to collect any other mail archives files, you need to track them using git-lfs before you commit any changes. .gitattributes contains the current list, and is checked in to the repository. To track a new file extension, run:

git lfs track "*.bin"

(If you don't do this, migration is required, and it involves re-writing history and force pushing in order to avoid keeping duplicate copies in the .git directory, so let's try to avoid that.)

Clone

Make sure you have enough disk space. Currently, the ietf-archives repository requires about 25 gigabytes; that number is likely to increase.

Clone the repository as usual; there are a lot of files, but they should all be very small.

Next: git lfs fetch. This will download all the mail archive files, which can take a while. Then git lfs checkout will place all the large files in place in the repo; this takes just a few minutes.

Running a crawl yourself

This can take a day or so for large groups of mailing lists (like IETF or W3C), so run it on a separate server, rather than your laptop.

Start with a plaintext file that has a list of mailing list archive URLs; for example, ietf_lists_normalized.txt contains the known list of mailing list archive pages (not all of them hosted on ietf.org and this list probably needs to be updated). Open a screen session, and activate the appropriate bigbang environment. Then run a command like the following from the bigbang directory:

stdbuf -o 0 python bin/collect_mail.py -f ../ietf-archives/ietf_lists_normalized.txt --archives ../ietf-archives/ |& tee -a ../ietf-archives/log/collect-YYYYMMDD.log

That collects all the files in the separate ietf-archives directory and directs output to the log file.