An informal archive of music on the web before the age of MP3s. This is not intended to be comprehensive, but is instead an archive that exists alongside a companion machine learning model that uses the archive as its corpus - visit /midi-archive-neural-net for the accompanying machine learning model, and visit https://reubenson.com/midi-archive/ to see the project in production.
This repository implements Scrapy to collect MIDI files from websites before Y2K, and Eleventy for static site genraeration.
See https://medium.com/@reubenson/archives-ai-and-music-of-the-early-web-9b2f51fdef47 for a broad introduction to the project.
flowchart TD
A(Scrape MIDI files from early web) --> Z(Present model outout and interactive MIDI archive on Eleventy static site)
A --> B(Process MIDI into tokens for training neural net)
B --> C(Fetch tokens and tokenizer)
C --> D(Train neural net model)
D --> E(Export model to ONNX)
E --> F(Deploy to AWS Lambda)
F --> |Daily Cloudwatch Trigger|G(Generate MIDI sequence and save to S3)
G --> Z
- initialize venv
source /Users/reubenson/Projects/midi-archive/.venv/bin/activate
- install Scrapy
python3 -m pip install Scrapy
cd scraper
(need to be in the same directory as scrapy.cfg)- run scraper with
scrapy crawl archive -s LOG_LEVEL=WARNING
- before running that command, update the target in the script
The current process for updating the archive is a bit manual:
- Run the scraper tool, located in
/scraper
- Run the script for tokenizing all the MIDI files, at
/scripts/apply_tokenizer.py
- Zip up token json and upload to S3 with
/scripts/deploy_assets.sh
- MIDI Archive assets are now ready to be ingested by ML model
- Scrapy architecture
- The spider crawls pages, and then sends additional requests for assets like MIDI files
- These requests get passed through response middleware, where the response payload is then saved to disk in the assets directory
- Each page is also saved to disk, which is handled in a Pipeline
- Before getting saved to disk, the HTML needs to be updated, such that references to the assets (MIDI, CSS, images) are updated to point to the self-hosted paths
- In order to do this, the spider will keep track of every successful request and provide the asset path transformations needed to update HTML in the Pipeline
- HTML will be saved in a .md file with some additional heading data
- Before getting saved to disk, the HTML needs to be updated, such that references to the assets (MIDI, CSS, images) are updated to point to the self-hosted paths
- Each page the spider crawls will result in a markdown file, which will then be processed by 11ty to result in HTML pages served via the _sites directory
- Currently, the site is hosted on GitHub Pages, and will point to the docs directory to serve all static pages and assets
- Serve MIDI assets via S3 instead of GitHub
- Expand the archive with new and exciting MIDI discoveries
- Re-implement MIDI player with web components if more complicated features are to be added