Welcome to Zipfian Academy's Machine Learning workshop. Thank you for attending, we hope you enjoyed the lecture (we sure had fun presenting). This exercise will give you hands-on experience with the concepts covered, and will help solidify your understanding of the process of data science.
As always, feel free to email us about anything at all (questions, issues, concerns, feedback) at class@zipfianacademy.com. We would love to hear how you liked the class, whether the content was technical enough (or too technical), or any other topics you wish were covered.
We hope you have fun with this exercise! If you want to learn more or dive deeper into any of these subjects, we are always happy to discuss (and can talk for days about these subjects). And if you just can't get enough of this stuff (and want a completely immersive environment), you can apply for our intensive data science bootcamp starting January 20th.
This assignment assumes a basic familiarity with Python and is intended to teach you how to leverage it for data science. If you do not feel comfortable enough with Python (and programming in general) I recommend these (freely available) resources:
- Think Python
- MIT Open Courseware: A Gentle Introduction to Programming Using Python
- Learn Python the Hard Way
- Python Koans
This exercise is written in an IPython notebook and uses many of wonderful libraries from the scientific Python community. While you do not need IPython locally to complete the exercise (there are PDF and .ipynb versions of these instructions), I recommend setting it up on your computer if you plan to continue learning and playing with data. IPython notebooks not only provide an interface to interactively run (and debug) code in a web browser, but also to document your file as you go along. Below are the steps to setup a scientific Python environment on your computer to complete this (and all future class') assignment. If you have tips or suggestions to make this process easier, please reach out either on Piazza or via email.
- Git: Distributed Version Control to keep track of changes and updates to files/data.
- virualenv: Python environment isolation to help manage dependencies with packages and versions.
- pythonbrew: Manage and install multiple versions of Python. Can be handy if you want to experiment with Python 3.x.
- Enthought Python Distribution: A freely available packaged environment for scientific Python.
- Scipy Superpack: Only for Mac OSX, but a one line shell script that installs all the fundamental scientific computing packages.
- pandas: Data analysis and statistical library providing functionality in Python similar to R.
if you are on OSX, you may need to install Xcode (with command line utilities) or install gcc directly
Tutorial walking you through the installation of these tools, with tests to make sure it all works.
In this tutorial we will be using the Grockit Question logs dataset to predict the probability of getting the next question correct. We will also cluster the data to find similar students. Once we know which students will perform worse than the others (classifier), we can recommend similar (clustering) students who performed well to study with.
- scikit-learn tutorial
- Kaggle: Getting started with Python for Data Science
- AMPlab: Machine Learning crash course (part 1)
- AMPlab: Machine Learning crash course (part 2)
- How Khan Academy is using Machine Learning to Assess Student Mastery
- Machine Learning with Python -- Logistic Regression
- Evaluation: Cross Validation
- Get the Data
- Preparation -- vectorization and feature preparation (engineering)
- Train -- fit/build model from known labeled data
- Test -- evaluate model with cross validation
- Predict -- run model on data with unknown labels
- Understand the various stages of the ML pipeline
- Obtain
- Prepare
- Train
- Test
- Predict
- Get experience building models with scikit-learn
- Decision Boundaries
- Cost Function
- Logistic Regression and the sigmoid function
- Cross Validation
- K-fold
- Hold out
- Optimization functions
- Classification vs. Regression
- Supervised vs. Unsupervised learning
- Kmeans clustering
- Distance functions (similarity)