Skip to content

Conversation

@philpax
Copy link
Member

@philpax philpax commented Nov 13, 2025

At present, this PR contains the raw download of the entire wiki as the Internet Archive has it, up to 2021-01-01 (the OVH fire was 2021-03-10. Hmm. I should have used that date instead.)

This was obtained using

gem install wayback_machine_downloader_straw
wayback_machine_downloader https://wiki.jc-mp.com -t 20210101 -d archive

Next steps involve:

  • removing anything that's not of relevance
  • matching up every file to their corresponding article, including articles that don't exist i the 2014 dump
    • this also involves relocating the images, in whatever form that we have them
  • extracting the actual content from each page, then feeding that content to a LLM to merge into the existing page/creating a new page

We can ideally pose this task well enough that it can be done overnight with local LLMs. It's going to take a bit of work to get it right, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants