Skip to content

Latest commit

 

History

History
505 lines (395 loc) · 29.9 KB

File metadata and controls

505 lines (395 loc) · 29.9 KB

Crazy Awesome Python

A selection of 52 curated nlp Python libraries and frameworks ordered by stars.

Checkout the interactive version that you can filter and sort: https://www.awesomepython.org/

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
https://github.com/huggingface/transformers
339 stars per week over 168 weeks
57,243 stars, 13,520 forks, 792 watches
created 2018-10-29, last commit 2022-01-21, main language Python
bert, deep-learning, flax, hacktoberfest, jax, language-model, language-models, machine-learning, model-hub, natural-language-processing, nlp, nlp-library, pretrained-models, python, pytorch, pytorch-transformers, seq2seq, speech-recognition, tensorflow, transformer

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
https://github.com/explosion/spaCy
56 stars per week over 394 weeks
22,253 stars, 3,691 forks, 564 watches
created 2014-07-03, last commit 2022-01-21, main language Python
ai, artificial-intelligence, cython, data-science, deep-learning, entity-linking, machine-learning, named-entity-recognition, natural-language-processing, neural-network, neural-networks, nlp, nlp-library, python, spacy, text-classification, tokenization

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
https://github.com/pytorch/fairseq
67 stars per week over 229 weeks
15,465 stars, 4,016 forks, 327 watches
created 2017-08-29, last commit 2022-01-21, main language Python
artificial-intelligence, python, pytorch

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
https://github.com/openai/gpt-2
98 stars per week over 153 weeks
15,196 stars, 3,884 forks, 585 watches
created 2019-02-11, last commit 2020-12-02, main language Python
paper

Topic Modelling for Humans
https://radimrehurek.com/gensim
https://github.com/RaRe-Technologies/gensim
22 stars per week over 571 weeks
12,855 stars, 4,151 forks, 431 watches
created 2011-02-10, last commit 2021-12-24, main language Python
data-mining, data-science, document-similarity, fasttext, gensim, information-retrieval, machine-learning, natural-language-processing, neural-network, nlp, python, topic-modeling, word-embeddings, word-similarity, word2vec

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
https://github.com/huggingface/datasets
124 stars per week over 95 weeks
11,867 stars, 1,447 forks, 241 watches
created 2020-03-26, last commit 2022-01-21, main language Python
computer-vision, datasets, deep-learning, evaluation, machine-learning, metrics, natural-language-processing, nlp, numpy, pandas, pytorch, tensorflow

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://github.com/flairNLP/flair
59 stars per week over 188 weeks
11,167 stars, 1,804 forks, 202 watches
created 2018-06-11, last commit 2022-01-18, main language Python
machine-learning, named-entity-recognition, natural-language-processing, nlp, pytorch, semantic-role-labeling, sequence-labeling, word-embeddings

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
https://github.com/allenai/allennlp
43 stars per week over 244 weeks
10,763 stars, 2,154 forks, 282 watches
created 2017-05-15, last commit 2022-01-14, main language Python
data-science, deep-learning, natural-language-processing, nlp, python, pytorch

NLTK Source
https://www.nltk.org
https://github.com/nltk/nltk
16 stars per week over 645 weeks
10,389 stars, 2,541 forks, 476 watches
created 2009-09-07, last commit 2022-01-11, main language Python
machine-learning, natural-language-processing, nlp, nltk, python

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
https://github.com/facebookresearch/ParlAI
34 stars per week over 247 weeks
8,578 stars, 1,742 forks, 294 watches
created 2017-04-24, last commit 2022-01-21, main language Python

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
https://github.com/clips/pattern
14 stars per week over 559 weeks
8,138 stars, 1,583 forks, 558 watches
created 2011-05-03, last commit 2020-04-25, main language Python
machine-learning, natural-language-processing, network-analysis, python, sentiment-analysis, web-mining, wordnet

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
https://github.com/sloria/TextBlob
17 stars per week over 447 weeks
8,034 stars, 1,062 forks, 272 watches
created 2013-06-30, last commit 2021-10-22, main language Python
natural-language-processing, nlp, nltk, pattern, python, python-2, python-3

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
https://github.com/UKPLab/sentence-transformers
52 stars per week over 130 weeks
6,894 stars, 1,363 forks, 103 watches
created 2019-07-24, last commit 2022-01-19, main language Python

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
https://github.com/EleutherAI/gpt-neo
74 stars per week over 81 weeks
6,034 stars, 484 forks, 149 watches
created 2020-07-05, last commit 2022-01-17, main language Python
gpt, gpt-2, gpt-3, language-model, transformers

Open source annotation tool for machine learning practitioners.
https://doccano.herokuapp.com
https://github.com/doccano/doccano
29 stars per week over 193 weeks
5,712 stars, 1,231 forks, 116 watches
created 2018-05-09, last commit 2022-01-22, main language Python
annotation-tool, data-labeling, dataset, datasets, machine-learning, natural-language-processing, nuxt, nuxtjs, python, text-annotation, vue, vuejs

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
https://github.com/minimaxir/textgenrnn
19 stars per week over 232 weeks
4,617 stars, 736 forks, 142 watches
created 2017-08-07, last commit 2020-07-14, main language Python
deep-learning, keras, python, tensorflow, text-generation

Reading Wikipedia to Answer Open-Domain Questions
https://github.com/facebookresearch/DrQA
17 stars per week over 237 weeks
4,190 stars, 884 forks, 171 watches
created 2017-07-07, last commit 2021-05-18, main language Python

Model parallel transformers in JAX and Haiku
https://github.com/kingoflolz/mesh-transformer-jax
83 stars per week over 45 weeks
3,766 stars, 455 forks, 66 watches
created 2021-03-13, last commit 2022-01-19, main language Python

A PyTorch-based Speech Toolkit
http://speechbrain.github.io
https://github.com/speechbrain/speechbrain
40 stars per week over 90 weeks
3,633 stars, 657 forks, 104 watches
created 2020-04-28, last commit 2022-01-18, main language Python
asr, audio, audio-processing, deep-learning, huggingface, language-model, pytorch, speaker-diarization, speaker-recognition, speaker-verification, speech-enhancement, speech-processing, speech-recognition, speech-separation, speech-to-text, speech-toolkit, speechrecognition, spoken-language-understanding, transformers, voice-recognition

Data augmentation for NLP
https://makcedward.github.io/
https://github.com/makcedward/nlpaug
19 stars per week over 148 weeks
2,874 stars, 329 forks, 35 watches
created 2019-03-21, last commit 2022-01-04, main language Jupyter Notebook
adversarial-attacks, adversarial-example, ai, artificial-intelligence, augmentation, data-science, machine-learning, ml, natural-language-processing, nlp

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
https://github.com/minimaxir/gpt-2-simple
19 stars per week over 145 weeks
2,853 stars, 608 forks, 75 watches
created 2019-04-13, last commit 2021-10-18, main language Python
openai, tensorflow, text-generation, textgenrnn

A Unified Toolkit for Deep Learning Based Document Image Analysis
https://layout-parser.github.io/
https://github.com/Layout-Parser/layout-parser
32 stars per week over 84 weeks
2,781 stars, 266 forks, 54 watches
created 2020-06-10, last commit 2022-01-12, main language Python
computer-vision, deep-learning, detectron2, document-image-processing, document-layout-analysis, layout-analysis, layout-detection, layout-parser, object-detection, ocr

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
https://github.com/life4/textdistance
10 stars per week over 246 weeks
2,591 stars, 208 forks, 59 watches
created 2017-05-05, last commit 2021-11-29, main language Python
algorithm, algorithms, damerau-levenshtein, damerau-levenshtein-distance, diff, distance, distance-calculation, hamming-distance, jellyfish, levenshtein, levenshtein-distance, python, textdistance

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
https://github.com/huggingface/neuralcoref
10 stars per week over 237 weeks
2,470 stars, 431 forks, 90 watches
created 2017-07-03, last commit 2021-06-22, main language C
coreference, coreference-resolution, machine-learning, neural-networks, nlp, python, pytorch, spacy, spacy-extension, spacy-pipeline

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
https://github.com/jbesomi/texthero
25 stars per week over 93 weeks
2,423 stars, 210 forks, 44 watches
created 2020-04-06, last commit 2021-07-19, main language Python
machine-learning, nlp, nlp-pipeline, text-clustering, text-mining, text-preprocessing, text-representation, text-visualization, texthero, word-embeddings

LightSeq: A High Performance Library for Sequence Processing and Generation
https://github.com/bytedance/lightseq
17 stars per week over 111 weeks
1,958 stars, 205 forks, 43 watches
created 2019-12-06, last commit 2022-01-10, main language Cuda
accelerate, bart, beam-search, bert, cuda, diverse-decoding, gpt, inference, multilingual-nmt, sampling, training, transformer

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
https://github.com/MaartenGr/BERTopic
26 stars per week over 69 weeks
1,836 stars, 266 forks, 29 watches
created 2020-09-22, last commit 2022-01-03, main language Python
bert, ldavis, machine-learning, nlp, sentence-embeddings, topic, topic-modeling, topic-modelling, topic-models, transformers

🎐 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
https://github.com/jamesturk/jellyfish
2.67 stars per week over 602 weeks
1,607 stars, 150 forks, 42 watches
created 2010-07-09, last commit 2022-01-07, main language Python
fuzzy-search, hacktoberfest, hamming, jaro-winkler, levenshtein, metaphone, python, soundex

💡 Build AI-powered semantic search applications
https://neuml.github.io/txtai
https://github.com/neuml/txtai
21 stars per week over 76 weeks
1,602 stars, 158 forks, 39 watches
created 2020-08-09, last commit 2022-01-22, main language Python
api, audio-search, cloud-native, contextual-search, deep-learning, document-search, image-search, machine-learning, machine-learning-pipelines, machine-learning-workflows, microservice, neural-search, nlp, python, search, semantic-search, similarity-search, txtai, vector-search, video-search

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
https://github.com/Franck-Dernoncourt/NeuroNER
6.23 stars per week over 254 weeks
1,586 stars, 474 forks, 84 watches
created 2017-03-07, last commit 2019-10-02, main language Python
deep-learning, machine-learning, named-entity-recognition, neural-networks, nlp, tensorflow

Top2Vec learns jointly embedded topic, document and word vectors.
https://github.com/ddangelov/Top2Vec
15 stars per week over 96 weeks
1,509 stars, 222 forks, 34 watches
created 2020-03-20, last commit 2022-01-12, main language Python
bert, document-embedding, pre-trained-language-models, semantic-search, sentence-encoder, sentence-transformers, text-search, text-semantic-similarity, top2vec, topic-modeling, topic-modelling, topic-search, topic-vector, word-embeddings

A fast, efficient universal vector embedding utility package.
https://github.com/plasticityai/magnitude
7.35 stars per week over 204 weeks
1,500 stars, 105 forks, 37 watches
created 2018-02-24, last commit 2020-07-17, main language Python
embeddings, fast, fasttext, gensim, glove, machine-learning, machine-learning-library, memory-efficient, natural-language-processing, nlp, python, vectors, word-embeddings, word2vec

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
https://github.com/minimaxir/aitextgen
12 stars per week over 108 weeks
1,324 stars, 143 forks, 39 watches
created 2019-12-29, last commit 2021-05-17, main language Python

Renders papers from arXiv as responsive web pages so you don't have to squint at a PDF.
https://www.arxiv-vanity.com
https://github.com/arxiv-vanity/arxiv-vanity
5.54 stars per week over 232 weeks
1,287 stars, 84 forks, 22 watches
created 2017-08-12, last commit 2022-01-18, main language Python
academic-publishing, arxiv, latex, science

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
https://github.com/chrismattmann/tika-python
2.86 stars per week over 395 weeks
1,131 stars, 208 forks, 35 watches
created 2014-06-26, last commit 2021-06-07, main language Python
buffer, covid-19, detection, extraction, memex, mime, nlp, nlp-library, nlp-machine-learning, parse, parser-interface, python, recognition, text-extraction, text-recognition, tika-python, tika-server, tika-server-jar, translation-interface, usc

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy
https://spacy.io/usage/embeddings-transformers
https://github.com/explosion/spacy-transformers
8.24 stars per week over 130 weeks
1,074 stars, 137 forks, 27 watches
created 2019-07-26, last commit 2022-01-13, main language Python
bert, google, gpt-2, huggingface, language-model, machine-learning, natural-language-processing, natural-language-understanding, nlp, openai, pytorch, pytorch-model, spacy, spacy-extension, spacy-pipeline, transfer-learning, xlnet

💫 Models for the spaCy Natural Language Processing (NLP) library
https://spacy.io
https://github.com/explosion/spacy-models
3.92 stars per week over 253 weeks
995 stars, 233 forks, 41 watches
created 2017-03-14, last commit 2021-11-05, main language Python
machine-learning, machine-learning-models, models, natural-language-processing, nlp, spacy, spacy-models, statistical-models

📝 python package to calculate readability statistics of a text object - paragraphs, sentences, articles.
https://github.com/shivam5992/textstat
1.94 stars per week over 396 weeks
768 stars, 137 forks, 18 watches
created 2014-06-18, last commit 2021-10-24, main language Python
flesch-kincaid-grade, flesch-reading-ease, python, readability, smog, textstat

Full text geoparsing as a Python library
https://github.com/openeventdata/mordecai
2.18 stars per week over 291 weeks
634 stars, 84 forks, 34 watches
created 2016-06-23, last commit 2021-02-01, main language Python
geocoding, geonames, geoparsing, nlp, spacy, toponym-resolution

skweak: A software toolkit for weak supervision applied to NLP tasks
https://github.com/NorskRegnesentral/skweak
13 stars per week over 44 weeks
620 stars, 39 forks, 19 watches
created 2021-03-16, last commit 2022-01-18, main language Python
data-science, distant-supervision, natural-language-processing, nlp-library, nlp-machine-learning, python, spacy, training-data, weak-supervision

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
https://github.com/explosion/spacy-stanza
3.92 stars per week over 155 weeks
609 stars, 50 forks, 21 watches
created 2019-01-31, last commit 2021-12-13, main language Python
corenlp, data-science, machine-learning, natural-language-processing, nlp, spacy, spacy-pipeline, stanford-corenlp, stanford-machine-learning, stanford-nlp, stanza

A vector database for machine learning embeddings.
https://www.featureform.com
https://github.com/featureform/embeddinghub
8.87 stars per week over 66 weeks
588 stars, 15 forks, 7 watches
created 2020-10-16, last commit 2022-01-20, main language JavaScript
data-science, embeddings, embeddings-similarity, hacktoberfest, machine-learning, vector-database

⚫ A spaCy pipeline and model for NLP on unstructured legal text.
https://research.iclr.co.uk
https://github.com/ICLRandD/Blackstone
3.54 stars per week over 147 weeks
524 stars, 76 forks, 35 watches
created 2019-03-25, last commit 2021-01-31, main language Python
caselaw, law, legaltech, nlp, spacy-models

👑 spaCy building blocks and visualizers for Streamlit apps
https://share.streamlit.io/ines/spacy-streamlit-demo/master/app.py
https://github.com/explosion/spacy-streamlit
6.04 stars per week over 82 weeks
500 stars, 76 forks, 14 watches
created 2020-06-23, last commit 2021-12-30, main language Python
dependency-parsing, machine-learning, named-entity-recognition, natural-language-processing, ner, nlp, part-of-speech-tagging, spacy, streamlit, text-classification, tokenization, visualizer, visualizers, word-vectors

LexNLP by LexPredict
https://github.com/LexPredict/lexpredict-lexnlp
2.22 stars per week over 225 weeks
499 stars, 140 forks, 49 watches
created 2017-09-30, last commit 2021-09-21, main language HTML
analytics, contracts, data, law, legal, legaltech, linguistics, ml, nlp

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
https://github.com/nipunsadvilkar/pysbd
1.68 stars per week over 241 weeks
405 stars, 47 forks, 11 watches
created 2017-06-11, last commit 2021-02-11, main language Python
python, rule-based, segmentation, sentence, sentence-boundary-detection, sentence-tokenizer

Python client for Dialogflow: Design and integrate a conversational user interface into your applications and devices.
https://dialogflow.com/
https://github.com/dialogflow/dialogflow-python-client-v2
1.54 stars per week over 221 weeks
342 stars, 137 forks, 50 watches
created 2017-10-24, last commit 2022-01-22, main language Python
dialogflow, machine-learning, python

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
https://arxiv.org/abs/2109.00859
https://github.com/salesforce/CodeT5
12 stars per week over 22 weeks
290 stars, 47 forks, 11 watches
created 2021-08-16, last commit 2022-01-07, main language Python
code-intelligence, language-model, nlp, programming-language, representation-learning

Fixes contractions such as you're to you are
https://github.com/kootenpv/contractions
0.72 stars per week over 265 weeks
190 stars, 27 forks, 8 watches
created 2016-12-25, last commit 2021-11-15, main language Python

https://github.com/openai/grade-school-math
11 stars per week over 13 weeks
155 stars, 19 forks, 3 watches
created 2021-10-20, last commit 2021-11-19, main language Python

Parsers for scientific papers (PDF2JSON and TEX2JSON)
https://github.com/allenai/s2orc-doc2json
2.22 stars per week over 58 weeks
130 stars, 16 forks, 8 watches
created 2020-12-10, last commit 2021-09-17, main language Python

A small seq2seq punctuator tool based on DistilBERT
https://github.com/FerdinandZhong/punctuator
0.23 stars per week over 61 weeks
14 stars, 0 forks, 1 watches
created 2020-11-19, last commit 2021-12-14, main language Python
bert, bert-ner, chinese-nlp, deep-learning, nlp, punctuation, pytorch, seq2seq

This file was automatically generated on 2022-01-23.

To curate your own github list, simply clone and change the input csv file.

Inspired by:
https://github.com/vinta/awesome-python
https://github.com/trananhkma/fucking-awesome-python