Right now I am actively working on:
- Aozora Corpus Builder for AWS Lambda for doing event-triggered processing of Aozora Bunko HTML files (intended for ad-hoc S3 uploads, not the whole archive in bulk)
- Contributing to Narwhals compatibility layer for Python dataframe libraries
These are older tutorials (with Python code) for pre-processing Japanese text datasets for use with common analysis software, which I'm sharing as-is. Please freely reuse, fork, adapt, and/or steal for your own purposes -- that's why it's here!
- Aozora Corpus Builder for Aozora Bunko HTML files
- Taiyō Corpus Tools for NINJAL's early-1900s Taiyō magazine XML corpus
The writeups are much longer than the code itself! I created them as a resource for getting started with the niche technical issues you'll often encounter with trying to use Japanese data sources. (Not-Unicode and no word boundaries are the main challenges.)
Each of my projects above has their own dataset-specific resource section, but you might be interested in a more extensive guide at East Asian Digital Humanities page (external link). This includes semester-long course syllabus, weekend workshop materials, and previous blog posts about the Aozora project. It is not being actively updated, so be aware nothing is more recent than late 2019.
East Asian Digital Humanities has been taught since 2021 as part of UPenn's annual Dream Lab digital humanities workshop series, by Paula Curtis and Paul Vierthaler. Paula has extensively taught about Japanese text mining and digital methods, and you can find more information on her website.
Check out Digital Humanities Japan for a wiki and mailing list promoting resource-sharing and collaboration on Japanese-language digital projects and tech issues.
