midi-archive

An informal archive of music on the web before the age of MP3s. This is not intended to be comprehensive, but is instead an archive that exists alongside a companion machine learning model that uses the archive as its corpus - visit /midi-archive-neural-net for the accompanying machine learning model, and visit https://reubenson.com/midi-archive/ to see the project in production.

This repository implements Scrapy to collect MIDI files from websites before Y2K, and Eleventy for static site genraeration.

See https://medium.com/@reubenson/archives-ai-and-music-of-the-early-web-9b2f51fdef47 for a broad introduction to the project.

General Project Workflow

flowchart TD
    A(Scrape MIDI files from early web) --> Z(Present model outout and interactive MIDI archive on Eleventy static site)
    A --> B(Process MIDI into tokens for training neural net)
    B --> C(Fetch tokens and tokenizer)
    C --> D(Train neural net model)
    D --> E(Export model to ONNX)
    E --> F(Deploy to AWS Lambda)
    F --> |Daily Cloudwatch Trigger|G(Generate MIDI sequence and save to S3)
    G --> Z

Installation / Development

initialize venv source /Users/reubenson/Projects/midi-archive/.venv/bin/activate
install Scrapy python3 -m pip install Scrapy
cd scraper (need to be in the same directory as scrapy.cfg)
run scraper with scrapy crawl archive -s LOG_LEVEL=WARNING
- before running that command, update the target in the script

Scraping Workflow

The current process for updating the archive is a bit manual:

Run the scraper tool, located in /scraper
Run the script for tokenizing all the MIDI files, at /scripts/apply_tokenizer.py
Zip up token json and upload to S3 with /scripts/deploy_assets.sh
MIDI Archive assets are now ready to be ingested by ML model

Scraper

Scrapy architecture
The spider crawls pages, and then sends additional requests for assets like MIDI files
- These requests get passed through response middleware, where the response payload is then saved to disk in the assets directory
Each page is also saved to disk, which is handled in a Pipeline
- Before getting saved to disk, the HTML needs to be updated, such that references to the assets (MIDI, CSS, images) are updated to point to the self-hosted paths
  - In order to do this, the spider will keep track of every successful request and provide the asset path transformations needed to update HTML in the Pipeline
- HTML will be saved in a .md file with some additional heading data
Each page the spider crawls will result in a markdown file, which will then be processed by 11ty to result in HTML pages served via the _sites directory
- Currently, the site is hosted on GitHub Pages, and will point to the docs directory to serve all static pages and assets

To Do

Serve MIDI assets via S3 instead of GitHub
Expand the archive with new and exciting MIDI discoveries
Re-implement MIDI player with web components if more complicated features are to be added

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
docs		docs
scraper		scraper
scripts		scripts
src		src
.eleventy.js		.eleventy.js
.eslintrc.yml		.eslintrc.yml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
sites.json		sites.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

midi-archive

General Project Workflow

Installation / Development

Scraping Workflow

Scraper

To Do

About

Releases 1

Languages

License

reubenson/midi-archive

Folders and files

Latest commit

History

Repository files navigation

midi-archive

General Project Workflow

Installation / Development

Scraping Workflow

Scraper

To Do

About

Resources

License

Stars

Watchers

Forks

Releases 1

Languages