Jupyter notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.
The following article provides a nice overview:
Deschamps, Ryan, Ruest, Nick, Lin, Jimmy, Fritz, Samantha, Milligan, Ian. The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration. Proceedings of the 2019 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2019), June 2019, Urbana-Champaign, Illinois.
We suggest using Anaconda Distribution or Docker.
- Python 3.7+
- Jupyter Notebook (1.0.0)
- au_notebook (0.0.3)
- matplotlib (3.0.2)
- numpy (1.15.1)
- pandas (0.23.4)
- networkx (2.2)
- nltk (3.4.5)
- punkt
- vader_lexicon
- stopwords
Anaconda is a package manager that can help you find packages and dependencies, including some of the most popular ones used in data science research analysis. To run the Jupyter Notebook via Anaconda run the following:
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
pip install -r requirements.txt
python -m nltk.downloader punkt vader_lexicon stopwords
jupyter notebook
Docker is a container-based virtual machine system that bundles dependencies together, this means you can build the Docker image and it will work out of the box. To run the Jupyter Notebook via Docker, there are two options, Docker Hub and Docker Locally.
docker run --rm -it -p 8888:8888 archivesunleashed/auk-notebooks
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
docker build -t auk-notebook .
docker run --rm -it -p 8888:8888 auk-notebook
This repository comes with sample data, you can swap out the sample data with your own Archives Unleashed Cloud data.
docker run --rm -it -p 8888:8888 -v "/path/to/own/data:/home/jovyan/data" auk-notebook
Note: You must grant the within-container notebook user or group (NB_UID or NB_GID) write access to the host directory (e.g., sudo chown 1000 /some/host/folder/for/work).
There are several types of visualizations that you can produce in the Jupyter Notebook. A total of 14 outputs can be generated.
- Domain Analysis: Provides information about what has been crawled (e.g. which domains) and how often.
- Text Analysis: Highlights the frequency of words through various filters including domain and year.
- Sentiment Analysis: Visualizes sentiment scores by domain and year.
- Network Analysis: Shows the connections and relationship among websites through network graph layouts.
This repository also uses the Jupyter Docker Stacks, which provide several helpful options for customizing the container environment.
This application is available as open source under the terms of the Apache License, Version 2.0.
The example dataset in the data
directory was created with the Archives Unleashed Cloud, and is drawn from the B.C. Teachers' Labour Dispute (2014), collected by the University of Victoria Libraries. We are grateful that they've allowed us to use this material. The full-text derivative file is a random sample (37,000 lines) of the complete file because of GitHub file size limitations.
If you use this material, please cite it along the following lines:
- Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.
- University of Victoria Libraries, B.C. Teachers' Labour Dispute (2014), Archive-It Collection 4867, https://archive-it.org/collections/4867.
This work is primarily supported by the Andrew W. Mellon Foundation. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.