Text corpora in languages other than English. Curated with an eye towards digital humanities use.
This is not a directory but a moderately-opinionated, potentially one-time list of resources that might be of use to digital humanities folks working with languages other than English. That said, if you have suggestions, you can make a pull request. Or, fill out this form.
- China Historical GIS: "comprehensive series of datasets related to the administrative geography of Chinese History. The data layers include nationwide coverages (for the years 1820 and 1911), and time series (for the Dynastic period from 221 BCE to 1911 CE). The administrative features include Provinces, Circuits, Prefectures, and Counties as they changed over time."
- Chinese Biographical Database Project (CBDB): Harvard project, freely accessible relational database with biographical information about approximately 491,000 individuals as of May 2021, primarily from the 7th through 19th centuries.
- CNKI - 中国知网: well-supported (and funded), easily accessible, some censorship and missing articles.
- ctext.org: online open-access digital library, with the full text of various Chinese texts of philosophical, historical, or linguistic interest from the pre-Qin era through to the Han dynasty and beyond.
-
- Kanseki Repository: downloadable, CC-licensed collection of premodern Chinese texts. Corpus is also available for download on GitHub.
- Scripta Sinica - 漢籍全文資料庫 : 1,349 new titles and 754,200,198 characters of materials pertaining to the traditional Chinese classics
- The Bookshelf: images + text for rare and ancient books.
- Epistemological Letters: correspondence in English, German, and French about the field of physics between November 1973 and October 1984.
- EpiDat database of Jewish tombstones (includes Jewish tombstones in Hebrew as well). As a database for Jewish gravestone epigraphy, epidat is used to inventory, document, edit and present epigraphic holdings. Currently inscriptions of Jewish cemeteries from nine centuries and six countries are made available via chronological, spatial and thematic approaches.
- Epistemological Letters: correspondence in English, German, and French about the field of physics between November 1973 and October 1984.
- EpiDat database of Jewish tombstones (includes Jewish tombstones in German as well). As a database for Jewish gravestone epigraphy, epidat is used to inventory, document, edit and present epigraphic holdings. Currently inscriptions of Jewish cemeteries from nine centuries and six countries are made available via chronological, spatial and thematic approaches.
- Aozora Search: digitized text with Philologic text mining tools
- SAT Daizōkyō Text Database: full text of 85 volumes of Taishō Shinshū Daizōkyō (大正新脩大藏經). Digitizing and encoding project also encoding new characters.
- Digital Tale of Genji
- Organization: East Asia TEI Special Interest Group run by Kiyonori Nagasaki & A. Charles Muller with a wiki and GitHub.
- Digital Scholarly Edition of Habsburg-Ottoman Diplomatic Sources 1500–1918 Arno Strohmeyer et al. 2022--, TEI digital scholarly edition of sources from Ottoman (Turkish) and Austrian (Early Modern German) archival holdings (with English translations for the Ottoman sources), currently focused on 18th century grand embassies.