This repo contains snippets of code with a little bit of theory and explanation that can be handy for beginning data scientists. It was created while I was attending Galvanize Data Science immersive program in Seattle. Code is in Python, specifically in ipython notebooks as they are easy to view on GitHub.
Pull requests with updates are welcome!
- Hypothesis testing - t-test, z-test, AB testing. hypothesis_testing.ipynb
- Probability and statistics - combinatorics, conditional probability, statistical distributions. probability_and_statistics.ipynb
- Gradient descent - gradient_descent.ipynb
- ML algorithms - linear regression, logistic regression, decision trees, random forests, gradient boosting, PCA, SVD, NMF and more. ml_algorithms.ipynb
- Recommenders - different graphlab recommenders. recommenders.ipynb
NLP
- Doc2Vec - document similarity search using gensim. doc2vec.ipynb
- Text summarization - text summarization and keyword extraction using gensim. text_summarization.ipynb
- NearPy - locality sensitive hashing (LHS) for approximated nearest neighbor search. nearpy.ipynb
- Pipeline - pipeline, feature union, grid search. pipeline.ipynb
- Map Reduce - Hadoop, Spark. map_reduce.ipynb
- Scraping and MongoDB - requests, BeautifulSoup, pymongo. scraping_mongo.ipynb
- AWS Deployment - setting up EC2 instance with PostgreSQL on it, running Flask app. aws.md