-
Notifications
You must be signed in to change notification settings - Fork 51
Collecting and sharing mailing list archives
This page collects documentation on how to use and contribute to repositories of large mailing list archives.
Archives may not be publicly accessible (for privacy purposes, to avoid large data transfer costs), but if you're doing research on this, just reach out to the contact and they'll provide access.
-
IETF:
ietf-archives
repository; Contact: @npdoty
full ietf-archives
; Contact: @nllz
- ICANN: Contact @nllz(http://103.104.244.5/ICANNmailMarch2018.tar.xz)
- W3C: Contact @npdoty
If you have collected a set of mailing list archives for some organization, feel free to add them here.
Mailing list archive repositories use git-lfs to store the contents of the .mail
and other large files. So first, install git-lfs
either with brew install git-lfs
or with the operating-system-specific latest release.
Configure git-lfs
with this command:
git lfs install --force --skip-smudge
You can also just run git lfs install
but in order to make the first clone finish in a reasonable amount of time, you don't want git lfs
to run as part of the smudge
step.
GitHub runs a Git LFS server, but the storage and download bandwidth cost extra, so don't run these steps on a whim.
Currently, git-lfs
is used for:
.mbox
.mail
.txt.gz
If you are going to collect any other mail archives files, you need to track them using git-lfs
before you commit any changes. .gitattributes
contains the current list, and is checked in to the repository. To track a new file extension, run:
git lfs track "*.bin"
(If you don't do this, migration is required, and it involves re-writing history and force pushing in order to avoid keeping duplicate copies in the .git
directory, so let's try to avoid that.)
Make sure you have enough disk space. Currently, the ietf-archives repository requires about 25 gigabytes; that number is likely to increase.
Clone the repository as usual; there are a lot of files, but they should all be very small.
Next: git lfs fetch
. This will download all the mail archive files, which can take a while. Then git lfs checkout
will place all the large files in place in the repo; this takes just a few minutes.
This can take a day or so for large groups of mailing lists (like IETF or W3C), so run it on a separate server, rather than your laptop.
Start with a plaintext file that has a list of mailing list archive URLs; for example, ietf_lists_normalized.txt
contains the known list of mailing list archive pages (not all of them hosted on ietf.org and this list probably needs to be updated). Open a screen
session, and activate the appropriate bigbang
environment. Then run a command like the following from the bigbang
directory:
stdbuf -o 0 python bin/collect_mail.py -f ../ietf-archives/ietf_lists_normalized.txt --archives ../ietf-archives/ |& tee -a ../ietf-archives/log/collect-YYYYMMDD.log
That collects all the files in the separate ietf-archives
directory and directs output to the log file.