corpus-linguistics

Here are 131 public repositories matching this topic...

BLKSerene / Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

translation tokenizer corpus linguistics tagger literature dependency-parser corpus-linguistics lemmatizer corpus-tools corpus-processing corpus-search corpus-statistics stopword corpus-analysis

Updated Oct 30, 2025
Python

google / corpuscrawler

Star

Crawler for linguistic corpora

crawling linguistics corpus-linguistics corpus-builder minority-language

Updated Aug 18, 2025
Python

OliverHellwig / sanskrit

Star

Data for the quantitative study of (Vedic) Sanskrit

corpus-linguistics sanskrit historical-linguistics ancient-languages

Updated Mar 5, 2026
Python

natasha / nerus

Star

Large silver standart Russian corpus with NER, morphology and syntax markup

python nlp syntax morphology russian corpus-linguistics ner

Updated Apr 13, 2026
Python

LanguageMachines / PICCL

Star

A set of workflows for corpus building through OCR, post-correction and normalisation

nlp workflow ocr computational-linguistics corpus-linguistics folia corpus-tools

Updated Sep 7, 2022
Python

MarsPanther / Amharic-English-Machine-Translation-Corpus

Star

Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.

nlp natural-language-processing machine-translation corpus ethiopia amharic corpus-linguistics nlp-machine-learning amharic-corpus ethiopian english-date amaharic-english

Updated Aug 2, 2018
Python

interrogator / conll-df

Star

CONLL-U to Pandas DataFrame

nlp grammar pandas linguistics corpus-linguistics universal-dependencies conll-u

Updated Nov 21, 2017
Python

timarkh / tsakorpus

Star

Yet another search platform for linguistic corpora.

flask elasticsearch corpus linguistics corpus-linguistics corpus-tools linguistic-corpora language-documentation parallel-corpora media-aligned-corpora

Updated May 6, 2026
Python

undertheseanlp / corpus.viwiki

Star

Vietnamese Wikipedia Corpus

vietnamese corpus-linguistics corpus-data vietnamese-nlp

Updated May 18, 2017
Python

EdwardSeley / lyrics-corpora

Star

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

python music artists lyrics corpus songs python-api corpora corpus-linguistics scrapper scraping-websites corpus-tools billboard-charts

Updated Jul 2, 2018
Python

drgriffis / text-essence

Star

Preprocessing and analysis for training SNOMED-CT concept embeddings from CORD-19 corpus

natural-language-processing embeddings corpus-linguistics representation-learning

Updated Aug 4, 2023
Python

ilinguistics / corpus_similarity

Star

Measure the similarity of text corpora for 74 languages

nlp language natural-language-processing text corpus corpora corpus-linguistics corpus-tools corpus-processing

Updated Jan 26, 2024
Python

CompLin / nheengatu

Star

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

natural-language-processing dictionary tokenizer computational-linguistics corpus-linguistics pos-tagger tokenization nheengatu modern-tupi

Updated Mar 17, 2026
Python

MarsPanther / crawl-for-parallel-corpora

Star

simple bs4 based web crawl for a corpus in need of statistical machine translation

nlp natural-language-processing translation machine-translation amharic corpus-linguistics corpus-data amharic-corpus ethiopian-languages

Updated Aug 17, 2021
Python

magizbox / scraper

Star

Scraper

crawler vietnamese corpus-linguistics corpus-data vietnamese-nlp

Updated Dec 21, 2018
Python

AustinZuniga / Filipino-wordlist

Star

Filipino wordlist word-level

corpus wordlist corpus-linguistics corpus-data tagalog tagalog-dictionary filipino filipino-language dictionary-data filipino-wordlist filipino-corpus tagalog-words

Updated Dec 20, 2018
Python

CentreForDigitalHumanities / Textcavator

Star

The great textmining tool that obviates all others

text-analysis digital-humanities corpus-linguistics literary-studies digital-history corpus-search

Updated May 12, 2026
Python

IngoKl / textdirectory

Star

TextDirectory allows you to filter, transform, and combine multiple text files into one aggregated file.

python nlp data-science database text corpus corpus-linguistics plaintext

Updated Oct 2, 2022
Python

ilinguistics / common_crawl_corpus

Star

Scripts for building a geo-located web corpus using Common Crawl data

corpora corpus-linguistics web-crawling corpus-tools corpus-processing

Updated Jan 18, 2026
Python

jengzang / dialects-backend

Star

FastAPI backend for Chinese dialect geolinguistics research — phonological queries, Praat acoustic analysis, ML-based spatial clustering, and village dialect mapping APIs.

python redis machine-learning rest-api sqlite phonology corpus-linguistics praat fastapi

Updated May 12, 2026
Python

Improve this page

Add a description, image, and links to the corpus-linguistics topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-linguistics topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus-linguistics

Here are 131 public repositories matching this topic...

BLKSerene / Wordless

google / corpuscrawler

OliverHellwig / sanskrit

natasha / nerus

LanguageMachines / PICCL

MarsPanther / Amharic-English-Machine-Translation-Corpus

interrogator / conll-df

timarkh / tsakorpus

undertheseanlp / corpus.viwiki

EdwardSeley / lyrics-corpora

drgriffis / text-essence

ilinguistics / corpus_similarity

CompLin / nheengatu

MarsPanther / crawl-for-parallel-corpora

magizbox / scraper

AustinZuniga / Filipino-wordlist

CentreForDigitalHumanities / Textcavator

IngoKl / textdirectory

ilinguistics / common_crawl_corpus

jengzang / dialects-backend

Improve this page

Add this topic to your repo