Skip to content

Latest commit



505 lines (395 loc) · 29.9 KB

File metadata and controls

505 lines (395 loc) · 29.9 KB

Crazy Awesome Python

A selection of 52 curated nlp Python libraries and frameworks ordered by stars.

Checkout the interactive version that you can filter and sort:

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
339 stars per week over 168 weeks
57,243 stars, 13,520 forks, 792 watches
created 2018-10-29, last commit 2022-01-21, main language Python
bert, deep-learning, flax, hacktoberfest, jax, language-model, language-models, machine-learning, model-hub, natural-language-processing, nlp, nlp-library, pretrained-models, python, pytorch, pytorch-transformers, seq2seq, speech-recognition, tensorflow, transformer

💫 Industrial-strength Natural Language Processing (NLP) in Python
56 stars per week over 394 weeks
22,253 stars, 3,691 forks, 564 watches
created 2014-07-03, last commit 2022-01-21, main language Python
ai, artificial-intelligence, cython, data-science, deep-learning, entity-linking, machine-learning, named-entity-recognition, natural-language-processing, neural-network, neural-networks, nlp, nlp-library, python, spacy, text-classification, tokenization

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
67 stars per week over 229 weeks
15,465 stars, 4,016 forks, 327 watches
created 2017-08-29, last commit 2022-01-21, main language Python
artificial-intelligence, python, pytorch

Code for the paper "Language Models are Unsupervised Multitask Learners"
98 stars per week over 153 weeks
15,196 stars, 3,884 forks, 585 watches
created 2019-02-11, last commit 2020-12-02, main language Python

Topic Modelling for Humans
22 stars per week over 571 weeks
12,855 stars, 4,151 forks, 431 watches
created 2011-02-10, last commit 2021-12-24, main language Python
data-mining, data-science, document-similarity, fasttext, gensim, information-retrieval, machine-learning, natural-language-processing, neural-network, nlp, python, topic-modeling, word-embeddings, word-similarity, word2vec

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
124 stars per week over 95 weeks
11,867 stars, 1,447 forks, 241 watches
created 2020-03-26, last commit 2022-01-21, main language Python
computer-vision, datasets, deep-learning, evaluation, machine-learning, metrics, natural-language-processing, nlp, numpy, pandas, pytorch, tensorflow

A very simple framework for state-of-the-art Natural Language Processing (NLP)
59 stars per week over 188 weeks
11,167 stars, 1,804 forks, 202 watches
created 2018-06-11, last commit 2022-01-18, main language Python
machine-learning, named-entity-recognition, natural-language-processing, nlp, pytorch, semantic-role-labeling, sequence-labeling, word-embeddings

An open-source NLP research library, built on PyTorch.
43 stars per week over 244 weeks
10,763 stars, 2,154 forks, 282 watches
created 2017-05-15, last commit 2022-01-14, main language Python
data-science, deep-learning, natural-language-processing, nlp, python, pytorch

NLTK Source
16 stars per week over 645 weeks
10,389 stars, 2,541 forks, 476 watches
created 2009-09-07, last commit 2022-01-11, main language Python
machine-learning, natural-language-processing, nlp, nltk, python

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
34 stars per week over 247 weeks
8,578 stars, 1,742 forks, 294 watches
created 2017-04-24, last commit 2022-01-21, main language Python

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
14 stars per week over 559 weeks
8,138 stars, 1,583 forks, 558 watches
created 2011-05-03, last commit 2020-04-25, main language Python
machine-learning, natural-language-processing, network-analysis, python, sentiment-analysis, web-mining, wordnet

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
17 stars per week over 447 weeks
8,034 stars, 1,062 forks, 272 watches
created 2013-06-30, last commit 2021-10-22, main language Python
natural-language-processing, nlp, nltk, pattern, python, python-2, python-3

Multilingual Sentence & Image Embeddings with BERT
52 stars per week over 130 weeks
6,894 stars, 1,363 forks, 103 watches
created 2019-07-24, last commit 2022-01-19, main language Python

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
74 stars per week over 81 weeks
6,034 stars, 484 forks, 149 watches
created 2020-07-05, last commit 2022-01-17, main language Python
gpt, gpt-2, gpt-3, language-model, transformers

Open source annotation tool for machine learning practitioners.
29 stars per week over 193 weeks
5,712 stars, 1,231 forks, 116 watches
created 2018-05-09, last commit 2022-01-22, main language Python
annotation-tool, data-labeling, dataset, datasets, machine-learning, natural-language-processing, nuxt, nuxtjs, python, text-annotation, vue, vuejs

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
19 stars per week over 232 weeks
4,617 stars, 736 forks, 142 watches
created 2017-08-07, last commit 2020-07-14, main language Python
deep-learning, keras, python, tensorflow, text-generation

Reading Wikipedia to Answer Open-Domain Questions
17 stars per week over 237 weeks
4,190 stars, 884 forks, 171 watches
created 2017-07-07, last commit 2021-05-18, main language Python

Model parallel transformers in JAX and Haiku
83 stars per week over 45 weeks
3,766 stars, 455 forks, 66 watches
created 2021-03-13, last commit 2022-01-19, main language Python

A PyTorch-based Speech Toolkit
40 stars per week over 90 weeks
3,633 stars, 657 forks, 104 watches
created 2020-04-28, last commit 2022-01-18, main language Python
asr, audio, audio-processing, deep-learning, huggingface, language-model, pytorch, speaker-diarization, speaker-recognition, speaker-verification, speech-enhancement, speech-processing, speech-recognition, speech-separation, speech-to-text, speech-toolkit, speechrecognition, spoken-language-understanding, transformers, voice-recognition

Data augmentation for NLP
19 stars per week over 148 weeks
2,874 stars, 329 forks, 35 watches
created 2019-03-21, last commit 2022-01-04, main language Jupyter Notebook
adversarial-attacks, adversarial-example, ai, artificial-intelligence, augmentation, data-science, machine-learning, ml, natural-language-processing, nlp

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
19 stars per week over 145 weeks
2,853 stars, 608 forks, 75 watches
created 2019-04-13, last commit 2021-10-18, main language Python
openai, tensorflow, text-generation, textgenrnn

A Unified Toolkit for Deep Learning Based Document Image Analysis
32 stars per week over 84 weeks
2,781 stars, 266 forks, 54 watches
created 2020-06-10, last commit 2022-01-12, main language Python
computer-vision, deep-learning, detectron2, document-image-processing, document-layout-analysis, layout-analysis, layout-detection, layout-parser, object-detection, ocr

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
10 stars per week over 246 weeks
2,591 stars, 208 forks, 59 watches
created 2017-05-05, last commit 2021-11-29, main language Python
algorithm, algorithms, damerau-levenshtein, damerau-levenshtein-distance, diff, distance, distance-calculation, hamming-distance, jellyfish, levenshtein, levenshtein-distance, python, textdistance

✨Fast Coreference Resolution in spaCy with Neural Networks
10 stars per week over 237 weeks
2,470 stars, 431 forks, 90 watches
created 2017-07-03, last commit 2021-06-22, main language C
coreference, coreference-resolution, machine-learning, neural-networks, nlp, python, pytorch, spacy, spacy-extension, spacy-pipeline

Text preprocessing, representation and visualization from zero to hero.
25 stars per week over 93 weeks
2,423 stars, 210 forks, 44 watches
created 2020-04-06, last commit 2021-07-19, main language Python
machine-learning, nlp, nlp-pipeline, text-clustering, text-mining, text-preprocessing, text-representation, text-visualization, texthero, word-embeddings

LightSeq: A High Performance Library for Sequence Processing and Generation
17 stars per week over 111 weeks
1,958 stars, 205 forks, 43 watches
created 2019-12-06, last commit 2022-01-10, main language Cuda
accelerate, bart, beam-search, bert, cuda, diverse-decoding, gpt, inference, multilingual-nmt, sampling, training, transformer

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
26 stars per week over 69 weeks
1,836 stars, 266 forks, 29 watches
created 2020-09-22, last commit 2022-01-03, main language Python
bert, ldavis, machine-learning, nlp, sentence-embeddings, topic, topic-modeling, topic-modelling, topic-models, transformers

🎐 a python library for doing approximate and phonetic matching of strings.
2.67 stars per week over 602 weeks
1,607 stars, 150 forks, 42 watches
created 2010-07-09, last commit 2022-01-07, main language Python
fuzzy-search, hacktoberfest, hamming, jaro-winkler, levenshtein, metaphone, python, soundex

💡 Build AI-powered semantic search applications
21 stars per week over 76 weeks
1,602 stars, 158 forks, 39 watches
created 2020-08-09, last commit 2022-01-22, main language Python
api, audio-search, cloud-native, contextual-search, deep-learning, document-search, image-search, machine-learning, machine-learning-pipelines, machine-learning-workflows, microservice, neural-search, nlp, python, search, semantic-search, similarity-search, txtai, vector-search, video-search

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
6.23 stars per week over 254 weeks
1,586 stars, 474 forks, 84 watches
created 2017-03-07, last commit 2019-10-02, main language Python
deep-learning, machine-learning, named-entity-recognition, neural-networks, nlp, tensorflow

Top2Vec learns jointly embedded topic, document and word vectors.
15 stars per week over 96 weeks
1,509 stars, 222 forks, 34 watches
created 2020-03-20, last commit 2022-01-12, main language Python
bert, document-embedding, pre-trained-language-models, semantic-search, sentence-encoder, sentence-transformers, text-search, text-semantic-similarity, top2vec, topic-modeling, topic-modelling, topic-search, topic-vector, word-embeddings

A fast, efficient universal vector embedding utility package.
7.35 stars per week over 204 weeks
1,500 stars, 105 forks, 37 watches
created 2018-02-24, last commit 2020-07-17, main language Python
embeddings, fast, fasttext, gensim, glove, machine-learning, machine-learning-library, memory-efficient, natural-language-processing, nlp, python, vectors, word-embeddings, word2vec

A robust Python tool for text-based AI training and generation using GPT-2.
12 stars per week over 108 weeks
1,324 stars, 143 forks, 39 watches
created 2019-12-29, last commit 2021-05-17, main language Python

Renders papers from arXiv as responsive web pages so you don't have to squint at a PDF.
5.54 stars per week over 232 weeks
1,287 stars, 84 forks, 22 watches
created 2017-08-12, last commit 2022-01-18, main language Python
academic-publishing, arxiv, latex, science

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
2.86 stars per week over 395 weeks
1,131 stars, 208 forks, 35 watches
created 2014-06-26, last commit 2021-06-07, main language Python
buffer, covid-19, detection, extraction, memex, mime, nlp, nlp-library, nlp-machine-learning, parse, parser-interface, python, recognition, text-extraction, text-recognition, tika-python, tika-server, tika-server-jar, translation-interface, usc

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy
8.24 stars per week over 130 weeks
1,074 stars, 137 forks, 27 watches
created 2019-07-26, last commit 2022-01-13, main language Python
bert, google, gpt-2, huggingface, language-model, machine-learning, natural-language-processing, natural-language-understanding, nlp, openai, pytorch, pytorch-model, spacy, spacy-extension, spacy-pipeline, transfer-learning, xlnet

💫 Models for the spaCy Natural Language Processing (NLP) library
3.92 stars per week over 253 weeks
995 stars, 233 forks, 41 watches
created 2017-03-14, last commit 2021-11-05, main language Python
machine-learning, machine-learning-models, models, natural-language-processing, nlp, spacy, spacy-models, statistical-models

📝 python package to calculate readability statistics of a text object - paragraphs, sentences, articles.
1.94 stars per week over 396 weeks
768 stars, 137 forks, 18 watches
created 2014-06-18, last commit 2021-10-24, main language Python
flesch-kincaid-grade, flesch-reading-ease, python, readability, smog, textstat

Full text geoparsing as a Python library
2.18 stars per week over 291 weeks
634 stars, 84 forks, 34 watches
created 2016-06-23, last commit 2021-02-01, main language Python
geocoding, geonames, geoparsing, nlp, spacy, toponym-resolution

skweak: A software toolkit for weak supervision applied to NLP tasks
13 stars per week over 44 weeks
620 stars, 39 forks, 19 watches
created 2021-03-16, last commit 2022-01-18, main language Python
data-science, distant-supervision, natural-language-processing, nlp-library, nlp-machine-learning, python, spacy, training-data, weak-supervision

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
3.92 stars per week over 155 weeks
609 stars, 50 forks, 21 watches
created 2019-01-31, last commit 2021-12-13, main language Python
corenlp, data-science, machine-learning, natural-language-processing, nlp, spacy, spacy-pipeline, stanford-corenlp, stanford-machine-learning, stanford-nlp, stanza

A vector database for machine learning embeddings.
8.87 stars per week over 66 weeks
588 stars, 15 forks, 7 watches
created 2020-10-16, last commit 2022-01-20, main language JavaScript
data-science, embeddings, embeddings-similarity, hacktoberfest, machine-learning, vector-database

⚫ A spaCy pipeline and model for NLP on unstructured legal text.
3.54 stars per week over 147 weeks
524 stars, 76 forks, 35 watches
created 2019-03-25, last commit 2021-01-31, main language Python
caselaw, law, legaltech, nlp, spacy-models

👑 spaCy building blocks and visualizers for Streamlit apps
6.04 stars per week over 82 weeks
500 stars, 76 forks, 14 watches
created 2020-06-23, last commit 2021-12-30, main language Python
dependency-parsing, machine-learning, named-entity-recognition, natural-language-processing, ner, nlp, part-of-speech-tagging, spacy, streamlit, text-classification, tokenization, visualizer, visualizers, word-vectors

LexNLP by LexPredict
2.22 stars per week over 225 weeks
499 stars, 140 forks, 49 watches
created 2017-09-30, last commit 2021-09-21, main language HTML
analytics, contracts, data, law, legal, legaltech, linguistics, ml, nlp

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
1.68 stars per week over 241 weeks
405 stars, 47 forks, 11 watches
created 2017-06-11, last commit 2021-02-11, main language Python
python, rule-based, segmentation, sentence, sentence-boundary-detection, sentence-tokenizer

Python client for Dialogflow: Design and integrate a conversational user interface into your applications and devices.
1.54 stars per week over 221 weeks
342 stars, 137 forks, 50 watches
created 2017-10-24, last commit 2022-01-22, main language Python
dialogflow, machine-learning, python

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
12 stars per week over 22 weeks
290 stars, 47 forks, 11 watches
created 2021-08-16, last commit 2022-01-07, main language Python
code-intelligence, language-model, nlp, programming-language, representation-learning

Fixes contractions such as you're to you are
0.72 stars per week over 265 weeks
190 stars, 27 forks, 8 watches
created 2016-12-25, last commit 2021-11-15, main language Python
11 stars per week over 13 weeks
155 stars, 19 forks, 3 watches
created 2021-10-20, last commit 2021-11-19, main language Python

Parsers for scientific papers (PDF2JSON and TEX2JSON)
2.22 stars per week over 58 weeks
130 stars, 16 forks, 8 watches
created 2020-12-10, last commit 2021-09-17, main language Python

A small seq2seq punctuator tool based on DistilBERT
0.23 stars per week over 61 weeks
14 stars, 0 forks, 1 watches
created 2020-11-19, last commit 2021-12-14, main language Python
bert, bert-ner, chinese-nlp, deep-learning, nlp, punctuation, pytorch, seq2seq

This file was automatically generated on 2022-01-23.

To curate your own github list, simply clone and change the input csv file.

Inspired by: