Ritual is a very simple 4chan/lainchan archiver that complies to the Asagi schema.
It is built using Python3.12, and SQLite as a database. MySQL may be supported in the future.
-
Ritual is under 1,000 lines of python code.
- NeoFuuka has 2,600 lines of Python code
- Hayden has over 11,400 lines of C# code
-
The requests library is Ritual's only dependency.
-
The second advantage of Ritual is its super flexible configurations. Ritual allows you to choose whether you download text, thumbnails, and/or full media at the thread level - not the global level.
-
Ritual is simple. It has no threading, no complex database, and its main loop is easly to read in under a minute.
Ritual will create schemas for you.
- Create a file called
configs.pyusingrename_to_configs.py, and configure it. - Create a virtualenv and install dependencies,
python3.12 -m venv venvsource venv/bin/activatepython3.12 -m pip install -r requirements.txt
screen -S ritual(you might need tosudo apt install screen)python3.12 main.pyto run the scraper.ctrl-A,dto leave the screenscreen -r ritualto reattach to the screen
Recently, there was an update to the code to make it fully asagi schema compliant. Please make a backup of your database, then run the script ./migrations/migrate_asagi.py. After this, you can run VACUUM; on your database. The result of this will be a .db file the same size as before the migrations.
sqlite3 /path/to/db "VACUUM INTO '/path/to/backup'"
sqlite3 /path/to/backup 'PRAGMA integrity_check' # optional
gzip /path/to/backup # optional
Here is how the flexible archive configurations work.
op_comment_min_charsandop_comment_min_chars_uniquefilter everything first.- If a post is blacklisted and whitelisted, it will not be archived - blacklisted filters take precedence over whitelisted filters.
- If only a blacklist is specified, skip blacklisted posts, and archive everything else.
- If only a whitelist is specified, archive whitelisted posts, and skip everything else.
- If no white/black lists are specified, archive everything.
- If a thread is marked as "should archive" from the above rules, media downloads can be further filtered based on dl_thumbs, and db_full_media.
- To download all/no media, specify True/False. To filter media, assign a regex pattern.
Here is an example from rename_to_configs.py,
boards = {
'g': {
'blacklist': '.*(local models).*', # if an OP contains "local models" in the subject or comment - skip thread
'whitelist': '.*(home server|linux).*', # if not, then for OPs with "home server" or "linux" in the subject or comment...
'dl_thumbs': '.*(home server general).*', # download thumbnails, but ONLY if it's a "home server general"
'dl_full_media': '.*(wireguard).*', # if anyone mentions "wireguard", get the full media if applicable
'thread_text': True, # archive the text if we pass the black/white lists.
},
'gif': {
'thread_text': True, # only gather thread text from /gif/ - no files
},
'ck': {
'whitelist': '.*Coffee Time General.*', # only gather thread text, and thumbnails from "Coffee Time General" threads on /ck/
'dl_thumbs': True,
'dl_full_media': False,
'thread_text': True,
},
't': {
'dl_full_media_op': True, # download all thread text, but only thumbnails and full media for the OP posts on /t/
'dl_thumbs_op': True,
'thread_text': True,
}
'biz': {
'thread_text': True,
'op_comment_min_chars': 4, # OP comment must be at least 10 characters long (does not archive: "omg", ".", "lol", etc.)
'op_comment_min_chars_unique': 3, # OP comment must have 5 unique character (does not archive: ".", "lol", "hahaha", "aaaaa", etc.)
}
}