Skip to content

Curated collection of resources for Data Science topics. Complete for interview preparation.

Notifications You must be signed in to change notification settings

gsunit/Data-Science-Preparation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 

Repository files navigation

Data Science Preparation

P.S. Ctrl+F to serach for relevant keywords.

Preliminaries

If you are just beginning with ML & Data Science, a good first place to start will be

If you have already done the Andrew Ng course, you might want to brush up on the concepts through these notes.

If you want to make a list of important interview topics head over to this article.

Courses & Resources

Data Science Practice Questions

If you are clueless about which topic to start from in data science, but have some basic idea about ML, then simply give these questions a go. If you get a bunch of them wrong, you'll know where to start your preparation :)

SQL

Quickly go through the tutorial pages, you need not cram anything. Soon after, solve all the Hackerrank questions (in sequence, without skipping). Refer back to any of the tutorials or look up the discussion forum when stuck. You will learn more effectively this way and applying the various clauses will boost your recall.

Probability

Statistics

Why divide by n-1 in sample standard deviation
  • Let f(v) = sum( (x_i-v)^2 )/n. Using f'(v) = 0, minima occurs at v = sum(x_i)/n = sample mean
  • Thus, f(sample mean) < f(population mean), as minima occurs at sample mean
  • Thus, sample std < population std (when using n in denominator)
  • But our goal was to estimate a value close to population std using the data of samples.
  • So we bump us sample std a bit by decreasing its denominator to n-1. Thus, bringing sample std closer to population std
Generative vs Discriminative models, Prior vs Posterior probability
  • Prior: Pr(x) assumed distirbution for the param to be estimated without accounting for obeserved (sample) data
  • Posterior: Pr(x | obsvd data) accounting for the observed data
  • Likelihood: Pr(obsvd data | x) P(x|obsvd data) ---proportional to--- P(obsvd data|x) * P(x) posterior ---proportional to--- likelihood * prior

Linear Algebra

Distributions

Inferential Statistics

Notes on p-values, statistical significance
  • p-values

    • 0 <= p-value <= 1

    • The closer the p-value to 0, the more the confidence that the null hypothesis (that there is no difference between two things) is false.

    • Threshold for making the decision: 0.05. This means that if there is no difference between the two things, then and the same experiment is repeated a bunch of times, then only 5% of them would yield a wrong decision.

    • In essence, 5% of the experiments, where the differences come from weird random things, will generate a p-value less that 0.05.

    • Thus, we should obtain large p-values if the two things being compared are identical.

    • Getting a small p-value even when there is no difference is known as a False positive.'

    • If it is extremely important when we say that the two things are different, we use a smaller threshold like 0.1%.

    • A small p-value does not imply that the difference between the two things is large.

  • Error Types

    • Type-1 error: Incorrectly reject null (False positive)

    • Alpha: Prob(type-1 error) (aka level of significance)

    • Type-2 error: Fail to reject when you should have rejected null hypothesis (False negative)

    • Beta: Prob(type-2 error)

    • Power: Prob(Finding difference between when when it truly exists) = 1 - beta

    • Having power > 80% for a study is good. Calculated before study is conducted based on projections.

    • P-value: Prob(obtaining a result as extreme as the current one, assuming null is true)

    • Low p-value -> reject null hypothesis, high p-value -> fail to reject hypothesis

    • If p-value < alpha -> study was statistically significant. Alpha = 0.05 usually

Maximum Likelihood Notes
  • Goal of maximum likelihood is to find the optimal way to fit a distribution to the data.
  • Probability: Pr(x | mu,std): area under a fixed distribution
  • Likelihood: Pr(mu,std | x) : y-axis values on curve (distribution function that can be varied) for fixed data point

Statistical Tests

t-Test
- compares 2 means. Works well when sample size is small. We esimate popl_std by sample_std.

- We are less confident that the distribution resembles normal dist. As sample size increases, it approches normal dist (at about n~=30)

- t-value = signal/noise = (absolute diff bet two means)/(variability of groups) = | x1 - x2 | / sqrt(s1^2/n1  +  s2^2/n2)

- Thus, increasing variance will give you more noise. Increasing #samples will decrease the noise.

- Degrees of freedom (DOF) = n1 + n2 - 2

- if t-value > critical value (from table) => reject hypothesis (found a statistically significant diff bet two means) 

- Independent (unpaired) samples means that two separate populations used to take samples. Paired samples means samples taken from the same population, and now we are comparing two means.

- In a two tailed test, we are not sure which direction the variance will be. Considering alpha=0.05, the 0.05 is split into 0.025 on both of the tails. In the middle is the remaining 0.95. Run a one-tailed test if sure about the directionality.

- (mu, sigma) are population statistics. (x_bar, s) are sample statistics.
- Calculating t-statistic when comparing sample mean with an already known mean. t-statistic = (x_bar - mu)/ sqrt(s^2/n)
Z-test
- Z-test uses a normal distribution

- (mu, sigma) are population statistics. (x_bar, s) are sample statistics. 

- z-score = (x-mu)/sigma  // no. of std dev a particular sample (x) is away from population mean
- z-statistic = (x_bar - mu)/ sqrt(sigma^2/n) // no. of std dev sample mean is away from population mean
- t-statistic = (x_bar - mu)/ sqrt(s^2/n) // when population std dev (sigma) is unavailable we substitute with sample std dev (s)

- Use z-stat when pop_std (sigma) is known and n>=30. Otherwise use t-stat.
Z-test example
  • Z-score table
  • Question: Find z-critical score for two tailed test at alpha=0.03
    • This means rejection area on each tail = 0.03/2 = 0.015
    • So cumulative area till critical point on right = 1-0.015 = 0.985
    • Now look for value on vertical axis that corresponds to 0.985 on alpha=0.03 column
    • That value = 2.1 (z-critical score)
Chi-squred test
- chi^2 = sum( (observed-expected)^2 / (expected) )
- The larger the chi^2 value, the more likely the variables are related
- Correlation relationship between two attributes, A and B. A has c distinct values and B has r
- Contingency table: c values of A are the columns and r values of B the rows
- (Ai ,Bj): joint event that attribute A takes on value ai and attribute B takes on value bj
- oij= observed frequency, eij= expected frequency
- Test is based on a significance level, with (r -1)x(c-1) degrees of freedom
- Slides link: https://imgur.com/a/U4uJhHc
Statistical Tests notes
  • ANOVA test: compares >2 means
  • Chi-squared test: compares categorical variables
  • Shapiro Wilk test: test if a random sample comes from a normal distribution
  • Kolmogorov-Smirnov Goodness of Fit test: compares data with a known distribution to check if they have the same distribution

Linear Regression & Logistic Regression

Precision, Recall

Important Formulae
  • Sensitivity = True Positive Rate = TP/(TP+FN) = how sensitive is the model, same as recall
  • Specificity = 1 - False Positive Rate = 1 - FP/(FP+TN) = TN/(FP+TN)
  • 'P'recision = TP/(TP+FP) = TP / 'P'redicted Positive = how less often does the model raise a false alarm
  • 'R'ecall = TP/(TP+FN) = TP / 'R'eal Positive = of all the true cases, how many did we catch
  • F1-score = 2PrecisionRecall/(Precision + Recall) = geometric mean of precision & recall

Gradient Descent

Decision Trees & Random Forests

Information Gain
  • Information gain determines the reduction of the uncertainty after splitting the dataset on a particular feature such that if the value of information gain increases, that feature is most useful for classification.
  • IG = entropy before splitting - entropy after spliting
  • Entropy = - sum_over_n ( p_i * ln2(p_i) )
Gini Index
  • Higher the GI, more randomness. An attribute/feature with least gini index is preferred as root node while making a decision tree.
  • 0: all elements correctly divided
  • 1: all elements randomly distributed across various classes
  • 0.5: all elements uniformly distributed into some classes
  • GI (P) = 1 - sum_over_n(p_i^2) where
  • P=(p1 , p2 ,.......pn ) , and pi is the probability of an object that is being classified to a particular class.

Loss functions

Cross entropy loss
 - Cross entropy loss for class X = -p(X) * log q(X), where p(X) = prob(class X in target), q(X) = prob(class X in prediction)
 - E.g. labels: [cat, dog, panda], target: [1,0,0], prediction: [0.9, 0.05, 0.05]
 - Total CE loss for multi-class classification is the summation of CE loss of all classes
 - Binary CE loss = -p(X) * log q(X) - (1-p(X)) * log (1-q(X))
 - Cross entropy loss works even for target like [0.5, 0.1, 0.4] as we are taking the sums of CE loss of all classes
 - In multi-label classification target can be [1, 0, 1] (not one-hot encoded). Given prediction: [0.6, 0.7, 0.4]. Then CE loss is evaluated as
   - CE loss A = Binary CE loss with p(X) = 1, q(X) = 0.6
   - CE loss B = Binary CE loss with p(X) = 0, q(X) = 0.7
   - CE loss B = Binary CE loss with p(X) = 1, q(X) = 0.4
   - Total CE loss = CE loss A + CE loss B + CE loss B

L1, L2 Regression

PCA, SVM, LDA

PCA
- Create a covariance matrix of the variables. Its eigenval and eigenvec describe the full multi-dimensional dataset.
- Eigenvec describe the direction of spread, Eigenval describe the importance of certain directions in describing the spread.
- In PCA, sequentially determine the axes in which the data varies the most
- All selected axes are eigenvectors of the symmetric covariance matrix, thus they are mutually perpendicular
- Then reframe the data using a subset of the most influential axes, by plotting the projections of original points on these axes. Thus dimensional reduction.
- Singular Value Decomposition is a way to find those vectors 
SVM
- Margin is the smallest distance between decision boundary and data point.

- Maximum margin classifiers classify by using a decision boundary placed such that margin is maximized. Thus, they are super sensitive to outliers.

- Thus, when we allow some misclassifications to accomodate outliers, it is know as a Soft Margin Classifier aka Support Vector Classifier (SVC).

- Soft margin is determined through cross-validation. Support Vectors are those observations on the edge of Soft Margin.

- For 3D data, the Support Vector Classifier forms a plane. For 2D it forms a line.

- Support Vector Machines (SVM) moves the data into a higher dimension (new dimensions added by applying transformation on original dimensions)

- Then, a support vector classifier is found that separates the higher dimensional data into two groups.

- SVMs use Kernels that systematically find the SVCs in higher dimensions.

- Say 2D data transformed to 3D. Then Polynomial Kernels find 3D relationships between each pair of those 3D points. Then use them to find an SVC.

- Radial Basis Function (RBF) Kernel finds SVC in infinite dimensions. It behavs like a weighted nearest neighbour model (closest observations have the most impact on classification)

- Kernel functions do not need to transform points to higher dimenstion. They find pair-wise relationship between points as if they were in higher dimensions, known as Kernel Trick

- Polynomial relationship between two points a & b: (a*b + r)^d, where r & d are co-eff and degree of polynomial respectively found using cross validation

- RBF relationship between two points a & b: exp(-r (a-b)^2 ), where r determined using cross validation, scales the influence (in the weighted-nearest neighbour model)

Boosting

Adaboost
- Combines a lot of "weak learners" to make decisions.

- Single level decision trees (one root, two leaves), known as stumps.

- Each stump has a weighted say in voting (as opposed to random forests where each tree has an equal vote).

- Errors that the first stump makes, influences how the second stump is made. 
- Thus, order is important (as opposed to random forests where each tree is made independent of the others, doesnt matter the order in which trees are made)

- First all samples are given a weight (equal weights initially).
- Then first stump is made based on which feature classifies the best (feature with lowest Gini index chosen).
- Now to decide stump's weight in final classification, we calculate the following. 

- total_error = sum(weights of samples incorrectly classified)
- amount_of_say = 0.5log( (1-total_error)/total_error )

- When stump does a good job, amount_of_say is closer to 1.

- Now modify the weights so that the next stump learns from the mistakes.
- We want to emphasize on correctly classifying the samples that were wronged earlier.

- new_sample_weight = sample_weight * exp(amount_of_say) => increased sample weight
- new_sample_weight = sample_weight * exp(-amount_of_say) => decreased sample weight

- Then normalize new_sample_weights.
- Then create a new collection by sampling records, but with a greater probablilty of picking those which were wrongly classified earlier.
- This is where you can use new_sample_weights (normalized). After re-sampling is done, assign equal weights to all samples and repeat for finding second stump. 
Gradient Boost
- Starts by making a single leaf instead of a stump. Considering regression, leaf contains average of target variable as initial prediction.
    
- Then build a tree (usu with 8 to 32 leaves). All trees are scaled equally (unlike AdaBoost where trees are weighted while prediciton)

- The successive trees are also based on previous errors like AdaBoost.

- Using initial prediction, calculate distance from actual target values, call them residuals, and store them.

- Now use the features to predict the residuals. 
- The average of the values that finally end up in the same leaf is used as the predicted regression value for that leaf
- (this is true when the underlying loss function to be minimized is the squared residual fn.)

- Then 
- new_prediction = initial_prediction + learning_rate*result_from_tree1
- new_residual = target_value - new_prediction

- new_residual will be smaller than old_residual, thus we are taking small steps towards learning to predict target_value accurately

- Train new tree on the new_residual, add the result_from_tree2*learning_rate to new_prediction to update it. Rinse and repeat.

Quantiles

Clustering

Neural Networks

CNN notes
  • for data with grid like topology (1D audio, 2D image)
  • reduces params in NN through
    • sparse interactions
    • parameter sharing
      • CNN creates spatial features.
      • Image passed through CNN gives rise to a volume. Section of this volume taken through the depth represents features of the same part of image
      • Each feature in the same depth layer is generated by the same filter that convolves the image (same kernel, shared parameters)
    • equivariant representation
      • f(g(x)) = g(f(x))
  • Types of layers
    • Convolution layer - image convolved using kernels. Kernel applied through a sliding window. Depth of kernel = 3 for RGB image, 1 for grey-scale
    • Activation Layer -

Notes V.2

  • Problems with NN and why CNN?

    • The amount of weights rapidly becomes unmanageable for large images. For a 224 x 224 pixel image with 3 color channels there are around 150,000 weights that must be trained
    • MLP (multi layer perceptrons) react differently to an input (images) and its shifted version — they are not translation invariant
    • Spatial information is lost when the image is flattened into an MLP. Nodes that are close together are important because they help to define the features of an image
    • CNN’s leverage the fact that nearby pixels are more strongly related than distant ones. Influence of nearby pixels analyzed using filters.
  • Filters

    • reduces the number of weights
    • when the location of these features changes it does not throw the neural network off

    The convolution layers: Extracts features from the input The fully connected (dense) layers: Uses data from convolution layer to generate output

  • Why do CNN work efficiently?

    • Parameter sharing: a feature detector in the convolutional layer which is useful in one part of the image, might be useful in other ones
    • Sparsity of connections: in each layer, each output value depends only on a small number of inputs

Activation Function

Time-series Analysis

Feature Transformation

Python Pandas

About

Curated collection of resources for Data Science topics. Complete for interview preparation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published