An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
-
Updated
Oct 30, 2025 - Python
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Crawler for linguistic corpora
Data for the quantitative study of (Vedic) Sanskrit
Large silver standart Russian corpus with NER, morphology and syntax markup
A set of workflows for corpus building through OCR, post-correction and normalisation
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
CONLL-U to Pandas DataFrame
Yet another search platform for linguistic corpora.
Vietnamese Wikipedia Corpus
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Preprocessing and analysis for training SNOMED-CT concept embeddings from CORD-19 corpus
Measure the similarity of text corpora for 74 languages
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
simple bs4 based web crawl for a corpus in need of statistical machine translation
Scraper
Filipino wordlist word-level
The great textmining tool that obviates all others
TextDirectory allows you to filter, transform, and combine multiple text files into one aggregated file.
Scripts for building a geo-located web corpus using Common Crawl data
FastAPI backend for Chinese dialect geolinguistics research — phonological queries, Praat acoustic analysis, ML-based spatial clustering, and village dialect mapping APIs.
Add a description, image, and links to the corpus-linguistics topic page so that developers can more easily learn about it.
To associate your repository with the corpus-linguistics topic, visit your repo's landing page and select "manage topics."