Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Mar 17, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
A very simple news crawler with a funny name
Bitextor generates translation memories from multilingual websites
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
Utilities for Processing the Switchboard Dialogue Act Corpus
An open source reimplementation of Benny Brodda's BETA in Python
A set of workflows for corpus building through OCR, post-correction and normalisation
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
A parser for annotated MuseScore 3 files.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Yet another search platform for linguistic corpora.
Searching in-memory corpus with Corpus Query Language (CQL)
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Measure the similarity of text corpora for 74 languages
Scripts for building a geo-located web corpus using Common Crawl data
Library for Python to use Korp API
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."