This repository is a compilation of various sources of knowledge related to Data Science and Machine Learning. The list includes curated videos, blog posts, textbooks, GitHub repositories, and more. If you find this repository helpful, kindly consider giving it a star! Your support would aid me in improving and maintaining this project. Thank you for your time! πππππ
If you want to contribute to this list (please do), send me a pull request or contact me.
-
β Advanced NLP
-
β€οΈFoundations of Deep RL
-
β Stanford CS234: Reinforcement Learning | Winter 2019 - YouTube
- β CS324 - Large Language Models
- β Alpaca: A Strong, Replicable Instruction-Following Model
- β LLaMA: A foundational, 65-billion-parameter large language model
- β Comments from Reddit
- π Cabrita: A portuguese finetuned instruction LLaMA with portuguese data
- β€οΈ Exploring transformer models
- β A visual introduction to machine learning
- β How GPT3 Works - Visualizations and Animations
- β The illustrated Transformer
- β€οΈ The Illustrated Retrieval Transformer
- π Seeing Theory - A visual introduction to probability and statistics. Also, includes a textbook called "Seeing Theory"
- β wevi: Word Embedding Visual Inspector
- β AI/ML Cheatsheets for Stanford/MIT Courses
- β CS 229 β Machine Learning: A set of illustrated Machine Learning cheatsheets covering the content of the CS 229 class in Stanford
- β B2W Reviews - A very large dataset of customer reviews released by Americanas S.A. is now available at the Hugging Face Hub
- β IMDb PT-BR - A version of IMDb translated to brazilian portuguese
- β Yelp 2018 in CSV
- β Amazon Reviews
- β Distill.pub - A modern medium for presenting research that showcases AI/ML concepts in clear, dynamic and vivid form
- β Christopher Olah's Blog - A machine learning researcher who likes to understand things clearly, and explain them well
- β Jay Alammar - Visualizing machine learning one concept at a time
- β Multiple Classifier Systems β a brief introduction
- β Automatic Text Summarization with Machine Learning β An overview
- β Automatic Text Summarization Made Simple with Python
- β Autoregressive Models in Deep Learning β a Brief Survey by George Ho
- β A Brief Timeline of NLP from Bag of Words to the Transformer Family
- β The Annotated Transformer
- β HuggingFace - Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models
A growing curated list of machine learning books.
- πPattern Recognition and Machine Learning β Christopher M. Bishop
- πHands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems
- πDeep Learning β Ian Goodfellow (print)
- πMathematics for Machine Learning
- β€οΈRedes Neurais: PrincΓpios e PrΓ‘tica
- πGraph-Powered Machine Learning (English Edition)
- β€οΈPython for Data Analysis, 3E
- β PlotNeuralNet - generates LaTeX code for drawing neural networks for publications and presentations
- β NN-SVG generates SVGs for neural net architecture schematics
- β State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
- πBERTopic - A topic modeling technique that employs transformers and c-TF-IDF technique to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions
- β LSA-Text-Summarization - This code implements the summarization of text documents using Latent Semantic Analysis
- β extractive-text-summarization - Extractive text summarization based on word frequencies and spacy
- β SimCSE - Simple Contrastive Learning of Sentence Embeddings
- β Koan - A word2vec negative sampling implementation with correct CBOW update
- β Apache OpenNLP - a machine learning based toolkit for the processing of natural language text
- β sense2vec - Contextually-keyed word vectors
- β Mega - Moving Average Equipped Gated Attention. Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism.
Graph AI, which leverages machine learning methods to learn patterns on graph-structured data, has been a hot research topic. Graphs are a kind of data structure that models a set of objects (nodes) and their relationships (edges). The power of graph formalism lies both in its focus on relationships between points as well as in its generality. Recently, research in the graph domain with machine learning has received more and more attention because of the great expressive power of graphs. As a powerful non-Euclidean data structure for machine learning, graph draws attention to analyses that focus on node classification, link prediction, and clustering.
- β€οΈ Pytorch Geometric - A library built upon PyTorch to easily write and train Graph Neural Networks (GNNs)
- β€οΈ DGL library - An easy-to-use, high performance and scalable Python package for deep learning on graphs
- β€οΈ PyGOD - a Python library for graph outlier detection (anomaly detection)
- β€οΈ Graph-MLPMixer A Generalization of ViT/MLP-Mixer to Graphs
- β€οΈ StellarGraph A Python library for machine learning on graphs and networks which offers state-of-the-art algorithms for graph machine learning, making it easy to discover patterns and answer questions about graph-structured data
- ποΈ cuGraph - Represents a collection of packages focused on GPU-accelerated graph analytics
- β igraph - a fast and open-source C library to manipulate and analyze graphs with interfaces in Python, R and C++
- β Karate Club - an unsupervised machine learning extension library for NetworkX. Karate Club consists of state-of-the-art methods to do unsupervised learning on graph-structured data. According to the authors, it is a Swiss Army knife for small-scale graph mining research.
The Rust ML landscape is still young and better described as experimental. Nevertheless, Rust's performance, flexibility, and unique approach to abstractions make it a promising language for building backends for Machine Learning, which is nowadays dominated by C/C++.
- β huggingface/tokenizers - The core of tokenizers, written in Rust with a focus on performance and versatility
- β DimaKudosh/word2vec - Rust interface to word2vec
- πLinfa - Provide a comprehensive toolkit to build Machine Learning applications with Rust in spirit to Python's scikit-learn
- π₯Burn - This library aims to be a comprehensive deep-learning framework in Rust that offers exceptional flexibility for both researchers and practitioners
A faster run time is essential in machine learning, which explains why C++ is suitable for machine learning and large-scale AI applications. Nowadays, C++ is powering most machine learning engines.
- β€οΈ mlpack - an intuitive, fast, and flexible header-only C++ machine learning library with bindings to other languages
- β ensmallen - a high-quality C++ library for non-linear numerical optimization, which provides many types of optimizers that can be used for virtually any numerical optimization task
- β Armadillo - C++ library for linear algebra & scientific computing. Useful for algorithm development directly in C++ or quick conversion of research code into production environments
- πNumCpp - A Templatized Header Only C++ Implementation of the Python NumPy Library
- β DLib - C++ toolkit containing machine learning algorithms used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments
- β Caffe - A deep learning framework developed with cleanliness, readability, and speed in mind
- β DyNet - A dynamic neural network library working well with networks that have dynamic structures that change for every training instance. Written in C++ with bindings in Python
- β cuDF - Built based on the Apache Arrow columnar memory format, and with a pandas-like API that will be familiar to data engineers, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
- β Polars: Blazingly fast DataFrames in Rust - Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
- πstatannotations - Python package to optionally compute statistical test and add statistical annotations on plots generated with seaborn
- β€οΈstatsmodels - Python package that provides a complement to scipy for statistical computations, and allows users to explore data, estimate statistical models, perform statistical tests, and descriptive statistics
- β sktime - a library for time series analysis in Python. It provides a unified interface for multiple time series learning tasks
- πDarts - a Python library for user-friendly forecasting and anomaly detection on time series
- β Prophet - A procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects.
- β CausalML - a Python package that provides a suite of uplift modeling and causal inference methods using machine learning algorithms.
- β BiomedSciAI/causallib - Enables estimating the causal effect of an intervention on some outcome from real-world non-experimental observational data
- β SAINT - Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training.
- β ARM-Net - Adaptive Relation Modeling Network for Structured Data.
- β TabTransformer - A implementation in Keras of TabTansformer, an attention network for tabular data.