pAIthon/Notes.txt at master · khteh/pAIthon · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Conditional Probability:
P(a|b) = P(a ^ b) / P(b)
P(a ^ b) = P(a|b)P(b)
P(a ^ b) = P(b|a)P(a)

P(a|b)P(b) = P(b|a)P(a)
P(a|b) = P(b|a)P(a) / P(b) <= Bayes' Rule
P(b|a) = P(a|b)P(b) / P(a) <= Bayes' Rule

Bayes' Rule:
Knowing - P(visible effect|unknown cause)
We can conclude - P(unknown cause | visible effect)

Example:
Knowing - P(medical test result | disease)
We can conclude - P(disease | medical test result)

If Probability of 'a' and 'b' are independent,
P(a ^ b) = P(b)P(a)

P(a∣b) = P(b∣a)P(a) / P(b)

Bayes rule/formula:
Prior Odds: 5:95
Likelihood Ratio: P(A|B) / P(A|C) : (80/100) / (10/100) = (80/100) * (100/10) = 80/10 = 8
Posterior Odds: 40:95

25% of the positive tweets contain the word "happy". 13% of the tweets contain the word "happy". 40% of the total #tweets are positive. What's the probability of "happy to learn NLP" being positive?
Knowing - P(visible effect|unknown cause) : P(happy|positive) = .25
P(happy) = .13
P(positive) = .4
P(unknown cause | visible effect) : P(positive|happy) = P(happy|positive) * P(positive) / P(happy) = .25 * .4 / .13 = .769 = 77%

Prior Ratio = P(pos) / P(neg)
- Important only in unbalanced dataset since the division is not 1.

Naive Bayes:
- Supervised Learning.
- Shares many similarities with Logistic Regression.
- "Naive" because it makes assumptions about:
  (1) The features being independent.
  (2) Relative frequencies in the corpora.
      - Training dataset is artifically balanced. In reality, the data are not balanced and much noisier.
- Naive Bayes inference condition rule for binary classification: P(X(i)|pos)/P(X(i)|neg) multiplied for all i.
  - Positive if > 1. Negative if < 1. Neutral if 1.
- P(pos) / P(neg) * P(X(i)|pos)/P(X(i)|neg) is the full Naive Bayes formula for binary classification.
- A simple, fast and powerful method to establish a baseline quickly.
- A probabilistic model used for classification.
- To avoid numerical underflow due to multiplication of floating point values < 1, use Log Likelihood.
  - lambda = log(P(Xi|pos)/P(Xi|neg))
  - log(P(pos)/P(neg)) + sum(log(P(Xi|pos)/P(Xi|neg)))
  - Positive if > 0. Negative if < 0. Neutral if 0.
- Example use cases:
  - Sentiment analysis
  - Author identification.
  - Information retrieval.
    - P(document[i] | query)
  - Word disambiguation.
    - P(context1|text) / P(context2|text)

Laplacian Smoothing:
- Used to avoid 0 conditional probability.
- P(X(i)|class) = (freq(X(i), class) + 1) / (Nclass + N); N = total #samples in the dataset, Nclass = frequency of all X in the class.

Joint Probability:
P(A|b) = P(A, b) / P(b) : Comma signifies ^
       = alpha * P(A, b), where alpha = 1/P(b) which is just some constant.
       = alpha * <Probability distribution of 'b' which sums up to 1>
=> Conditional probability is proportional to Joint probability

P(A|b) = P(A, b) / P(b)
=> P(A, b) = P(b)P(A|b)
P(a,b,c,d) = P(d|a,b,c) * P(c|a,b) * P(b|a) * P(a)
The probability of each successive token is the product of the probabilities of all the precideing tokens.

Inclusion-Exclusion:
P(a v b) = P(a) + P(b) - P(a^b)

Marginalization: Calculate probability of independent variable based on given/known joint probabilities.
P(a) = P(a,b) + P(a, Not(b))
P(X = xi) = Summation over all the values (j) y could take on of P(X = xi, Y = yj)

Conditioning: Calculate probability of independent variable based on given/known conditional probabilities.
P(a) = P(a|b)P(b) + P(a|Not(b))P(Not(b))
P(X = xi) = Summation over all the values (j) y could take on of P(X = xi|Y = yj)P(Y = yj)

Bayesian Network: Data structure which represents the dependencis among random variables.
- Directed graph
- Each node represents a random variable.
- Arrow from X to Y means X is a parent of Y
- Each node X has a probability distrubution P(X | Parents(X))

Inference:
- Query X: variable for which to compute distribution
- Evidence variables E: observed variables for event e
- Hidden variables Y: non-evidence and non-query variable
- Goal: Calculate P(X | e) where e could be more than 1.

Inference by Enumeration:
P(X | e) = alpha * P(X,e) = alpha * Summation over all the values (j) y could take on of P(X, e, y)
X: query variable
e: evidence
y: ranges over values of hidden variables
alpha: normalizes the result.
Cons: Time complexity grows with complexity of the network

Aproximate Inference:
- Sampling:
  - Rejection Sampling: Filter out samples which do not match event specification
  - Likelihood Weighting:
    - Start by fixing the values of evidence variables.
    - Sample the non-evidence variables using conditional probabilities in the Bayesian Network
    - Weight each sample by its likelihood: the probability of all of the evidence.

Markov Assumption: The assumption that the current state depends on only a finite fixed number of previous states.
Markov Chain: A sequence of random variables where the distribution of each variable follows the Markov assumption.
- A type of stochastic model that describes a sequence of possible events.
- The probability of each event depends on the states of previous events.
- Can be depicted as a directed graph.
- Transition probabilities and matrix.
Markov Property: The probability of the next event only depends on the current event.
Hidden Markov Model: a Markov model for a system with hidden states which generate/emit some observed events.
- Markov model with states which are not directly observable.
- Emission probabilities and matrix.
Sensor Markov Assumption: the assumption that the evidence variable depends only on corresponding state.
Use cases:
(1) Part-of-speech (POS) tagging.

Supervised Learning > Classification:
Perceptron Learning Rule: Given data point (x,y), update each weight according to: w =  w + alpha*(y - h(X)) * x
Start with random weights, learn from data and update the weights that result in better weight vector which reduces loss.

Support Vector Machines:
- Maximum Margin Separator: Boundary that maximizes the distance between any of the data points
- If the data is not linearly separable, it works from higher dimension to find the separation boundary. For example, other shapes than linear lines.

Optimization Problem Formulation:
- Local Search:
  - Algorithms: Hill-Climbing, Simulated Annealing
- Linear Programming
  - Algorithms:
- Constraint Satisfaction
  - Algorithms: AC3, Backtracking

Supervised Learning > Regression:
Learning a function mapping an input point to a continuous value.

Types of loss functions:
0-1 Loss: Used in discrete classification
L1 Loss: |actual - prediction| -> Used in continuous number prediction. Used when we don't care about outliers.
L2 Loss: (actual - prediction)^2 -> Penalizes worse / bigger loss more harshly. Used when we care about outliers.

Underfitting (High bias):
- J(train) AND J(cv) are high
- The model has a strong preconception of how the data  should fit into itself. For example, linear function, etc.
- More data will NOT help.
- Example: Training dataset performance: 64%, Test dataset performance: 47%

Variance = measure of how well the model can generalize to unseen data.

Overfitting (High variance):
- J(train) is low; J(cv) is high
- A model that fits too closely to a particular data set and therefore may fail to generalize to future data. High variance because if the model overfits, a tiny change in one of the training data set will end up completely different model.
- This is a side effect of minimizing loss of a model. If loss = 0, the model may only work on the specific data set.
- One way to counter this problem is add other parameters to optimization. For example, consider complexity:
- More data is likely to help.
- One of the causes is data leakage - some of test data leak into training data.
- Example: Training dataset performance: 93%, Test dataset performance: 99%
Note: Overfitting is sometimes a good thing to begin with. It means the model is learning.

High bias AND high variance:
- J(train) is high; J(cv) higher than J(train)
- Sometimes happen in NN. Less in linear regression.
- Doesn't really happen for linear models applied to 1-D.
- Underfit and overfit on parts of the data at the same time.

Balanced (Goldilocks zone):
- Example: Training dataset performance: 98%, Test dataset performance: 96%

Cost(h) = Loss(h) + w*Complexity(h) : 'w' gives weight to the complexity

Adding the term w*Complexity(h) is called "Regularization": Penalizing hypotheses that are more complex to favour simpler, more general hypotheses
To overcome overfitting:
(1) Collect more training samples / observations.
(2) Use fewer features.
Many features + insufficient data -> overfitting
(3) Regularization - Gently reduces the impact of some of the features without eliminating them outright. It encourages the model to shrink the values of the parameters without necessarily setting them to 0 which eliminates the features.
    Cost(w,b) = (sum((f_w_b(x) - y) ** 2) + lambda * sum(w[j] ** 2)) / 2m
    f_w_b <= Linear regression modal
    sum((f_w_b(x) - y) ** 2) <= Mean squared error. Fit data
    sum(w[j] ** 2) <= Regularization term. Keep w[j] small
    lambda balances both goals
    Example: f_w_b = w1x + w2x^2 + w3x^3 + w4x^4 + b
    lambda = 0: Overfitting. Wiggly curve/graph which tries to fit every single data point in the training data set.
    lambda = 10^10: f_w_b = b Underfitting

Gradient Descent:
w = w - alpha * (dJ(w) / dw)
Gradient Descent withh regularization:
w = w - alpha/m * (sum((f_w_b(x) - y) * x) + lambda * w[j])
  = w * (1 - alpha * lambda / m) - (alpha / m ) * sum((f_w_b(x) - y) * x)
  alpha: 0.01, lambda: 10, m: 100 => (1 - alpha * lambda / m) = (1 - 0.001) = 0.999
  This shows that every iteration of gradient descent, regularization reduces the value of w, thus reduces the impact of the feature.
- Each iteration/epoch of gradient descent will process the entire X(m,n). Choose a batch size so that:
  (1) Leverage the vectorization processing
  (2) Faster iterations. Large batch size will take too long per iteration/epoch.
  (3) Batch size should be power of 2 and fits in RAM.

Exponentially Weighted Averages
- V(t) = beta * V(t-1) + (1 - beta) * Theta(t)
  - Computes data point with average over 1 / (1 - beta) data points.
  - beta: 0.9 -> Averages over 10t of data
  - beta: 0.98 -> Averages over 50t of data
  - beta: 0.5 -> Averages over 2t of data
  - Iniial warm-up data will be skewed. Use bias correction to mitigate that. V(t) / (1 - beta^t).

Gradient Descent with Momentum:
- Smooth out the oscillations of gradients towards the global minimum.
- Vdw and Vdb initialized to 0, beta usually 0.9 (Average over last 10 gradients). Bias correction can be skipped/ignored.
- Vdw = beta*Vdw + (1-beta)*dW; W -= alpha*Vdw
- Vdb = beta*Vdb + (1-beta)*db; b -= alpha*Vdb
Analogy to a ball rolling down a bowl:
- beta : friction
- Vdw, Vdb : velocity
- Dw, db : acceleration.

RMSProp (Root Mean Squared Prop):
- Sdw and Sdb initialized to 0, beta usually 0.9 (Average over last 10 gradients). Bias correction can be skipped/ignored.
- Use beta2 hyperparameter notationi to distinguish it from the beta used in Momentum.
- Sdw = beta2*Sdw + (1-beta2)*dW^2; W -= alpha*dw/(sqrt(Sdw) + epsilon)
- Sdb = beta2*Sdb + (1-beta2)*db^2; b -= alpha*db/(sqrt(Sdb) + epsilon)
- Either Sdw and/or Sdb could have big oscillations and the division by the sqrt of it smooths out the W and b updates.

Adam (Adaptive Moment Estimation):
- Combines the best of both worlds - Momentum and RMSProp
- Uses bias corrections for all Vdw, Vdb, Sdw and Sdb, i.e., V / (1 - beta^t)
- W -= alpha*Vdw/sqrt(Sdw + epsilon)
- b -= alpha*Vdb/sqrt(Sdb + epsilon)

Learning Rate Decay:
- Big alpha values in early epochs and slowly decreasing with increasing epoch#
- This enables fast learning during the start of the training and better convergence to global minimum with smaller area of oscillations around it.

Strategy to debugging a learning algorithm:
(1) Get more data - Fixes high variance.
(2) Try smaller sets of features (Simplify polynomial equation by removing some of the higher order components.) - Fixes high variance.
(3) Add more features - Fixes high bias
(4) Add polynomial features - Fixes high bias
(5) Decrease lambda - Fixes high bias
(6) Increase lambda - Fixes high variance.

Cross-validation / development dataset is used to evaluate trained models to see which one is the best choice.
Test dataset is used to evaluate the unbiased performance of the final system before deployed to production.
These 2 must use the same distribution of data.

Feature Scaling: When a feature has a large range of values, the weight/parameter calculated by the model will have small range. In this case, a small change in weight will have big change in the cost value.
This causes the gradient descent to take longer time to find it's global minimum, i.e., converge slower. To oversome this, scale the input data. 2 ways:
(1) Divide by the maximum value of the data range. This will produce scaled input data of [0, 1]
(2) Mean normalization. This will produce scaled input data of [-1, 1] which centers around 0.
    - Find the average value: miu = numpy.mean(X)
    - X -= miu
(3) Variance normalization. This will produce features with variance = 1.
    - sigma^2 = numpy.sum(X ** 2) / m
    - OR: sigma = numpy.std(X)
    - X /= sigma
(4) Z-score normalization: Find the standard deviation (d) and the mean. (x - ave) / d.
    Mean: sum(data) / len(data)
    standard deviation: sum((data[i] - mean) ** 2) / len(data)
Note:
- Sigma = standard deviation
- Sigma^2 = Variance.
  - Vairance quantifies the spread of a normal distribution.
  - Covariance measures the variance between 2 dimensions.
    - 0: independent
    - -ve: lobside spread in y=-x direction
    - +ve: lobside spread in y=x direction.

Batch Normalization:
- HxW for each batch of data, x, (x - miu_x) / sigma_x. Do it for every channel.
- NN hidden units' outputs, Z[l], normalized in the same way as the input data X.
  - Z_normalized = (Z - miu) / sqrt(sigma^2 + epsilon)
  - Z~ = gamma * Z_normalized + beta
- parameter b is cancelled out by mean and "- miu" calculation and is replaced with beta[l] in Z~ = gamma[l] * Z_normalized + beta[l]. So it is OK to drop it from forward propagation.
- If gamma = sqrt(sigma^2 + epsilon) and beta = miu, then Z~[i] = Z[i]. So, in a way, this is learning an identty function.
- Gamma abd beta are learnable parameters which help to scale mean and variance out of 0 and 1 to benefit from activation functions.
- Covariate shifts:
  - Different batches of data, especially between traning/validation/test datasets could have different distributions.
  - For example, colour vs. grey-scale images.
  - Changes in the distribution of one variable affect the distribution of other related variables.
- Helps make weights in later layers more resilient to changes in earlier layers due to covariate shifts.
  - For example, training on greyscale images and predict on color pictures.
  - This decoupling effect helps stabilize the learning process to have more firm ground on the truth functions, f(Y|X).
  - Smooths the cost function and speeds up learning.
- Provides a regularization side effect since it usually works on batches of data.
  - The mean and standard deviation/variance calculated on batches of data introduce noise similar to dropout.
- For test and prediction, which do NOT have/use batches of input, use exponential weighted average of gamma and beta.
- Diff between training and testing time:
  - The batch mean and standard deviation are used for training while the running average statistics from all batches (computed over the entire training dataset) are used for testing.
  - The running values are fixed after training.
Rules of thumb: Aim for [-1, 1] input data range. These need to rescale:
(1) [-100 , 100] : Too large
(2) [-0.001, 0.001] : Too small
(3) [98.6, 105]: Too large. Example, Fahrenheight body temperature.
Notes:
Feature scaling usually isn't required for your target variable (labels).
Feature scaling is usually not required with tree-based models (e.g. Random Forest) since they can handle varying features.
Amount of data used must measure up to the #parameters to fit or train on.

Reinforcement Learning:
Given a set of rewards or penalties, learn what actions to take in the future.
Markov Decision Process:
Model for decision-making, representing states, actions, and their rewards.
Q-Learning:
Method for learning a function Q(s,a), estimate of the value (reward) of performing action 'a' in state 's'. Start with Q(s,a) = 0 for all s,a.
Q(s,a) <- Q(s,a) + w*(new estimate - old estimate): w:0 old values are more important; w:1 new values are more important.
Q(s,a) <- Q(s,a) + w*((r + future reward estimate) - Q(s,a))
Q(s,a) <- Q(s,a) + w*((r + w1*max(Q(s',a') for all a')) - Q(s,a)): w1 provides weights for future vs current reward

Balance between Exploration and Exploitation.

Function Approximation:
Approximating Q(s,a), often by a function combining various features, rathen than storing one value for every state-action pair.
Similar to depth-limiting approach in MiniMax. Used when it is not feasible to explore all the posible values of Q(s,a) in a bigger state space.

Unsupervised Learning:
Given input data without any additional feedback (label), learn patterns, find structure in the data.
Tasks:
(1) Clustering: Organize a set of objects into groups in such a way that similar objects tend to be in the same group. Example: Genetic research, Image segmentation, Market research, Medical imaging, Social network analysis.
    One technique is k-means clustering: Algorithm for clustering data based on repeatedly assigning points to clusters (k number of clusters) and updating those clusters' centers
(2) Anomaly Detection: Find unusual events in the data.
    Use case: Fraud detection.
(3) Dimensionality Reduction: Compress large corpuses of data to a much smaller dataset without losing key / important information.

k-means clustering:
(1) Choose a number for 'k' cluster centroids, u[k]. k < len(data)
(2) Randomly assigns 'k's values which have the same dimension as the dataset.
(3) Assign samples to the nearest centroid (k)
    for i of range(len(data)):
      c[i] = index (from 1 to k) of cluster centroids closest to x[i]
    The distance can be calculated with min(|x[i] - u[k]|), or more generally, min(|x[i] - u[k]|  ** 2 )
(4) Move the cluster centroids to the average of the points
    for i of range(k):
      u[i] = average (mean) of points assigned to cluster k
    The means can be calculated with sum(x[]) / len(x[])
There will be situations where one of the uk has zero data points assigned to it. In this case, either
(1) Remove the empty cluster: k = k - 1
(2) Reinitialize the cluster centroids and restart the algorithm.
Optimization objective:
c[i] = index of cluster (1,2,...,k) to which sample x[i] is currently assigned to.
uk = cluster centroid k
uc[i] = location of centroid k to which x[i] is currently assigned to.
cost function: J(c, uk) = sum(|x[i] - u[k]|  ** 2 ) / m
Every step along the way reduces J(c, uk).
To minimize local minimum and choose the best cluster centroids, use multiple random initializations and choose the cluster with the lowest J(c,u). i.e., run the algorithm multiple times, maybe 50 to 1000.

Locality Sensitive Hashing:
- Find neighbours in high-dimensional spaces.
- Use multiple planes and a single hash value.
- Vector perpendicular to a plane is a Normal Vector, P.
- P @ V.T:
  - +ve: in the same side as P, -ve: Opposite side of P. 0: On the plane.
  - >= 0: h = 0; -ve: h = 0. Use numpy.sign().
  - hash value = sum(2 ** i x h[i]) for all i.

Anomaly Detection:
Density Estimation: Model p(x) from data. Probability of x[i] being seen in the dataset.
p(x[i]) < epsilon : Anomaly
p(x[i]) >= epsilon: OK
p = Gaussian distribution - Needs mean and sigma (standard distribution); sigma ** 2 is variance
To model p(x) with multiple featues (N), every feature is a Gaussian distribution.
(1) Calculate the mean and sigma of every feature.
(2) p(x) = p(x1; u1,s1) * p(x2; u2,s2) * p(x3; u3,s3) * ... * p(xN; uN,sN)

Anomaly Detection vs Supervised Learning:
Anomaly Detection is mostly used when:
(1) Very small number of positive exaples (y=1). 0-20 is common. Large number of negative (y=0)
(2) Many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous samples seen so far (training set)
Examples: Security related applications like fraud, defects detection of manufacturing very complex products like the aircraft engines, monitoring complex systems like in a data center.

Supervised Learning:
(1) Large number of positive and negative examples
(2) Enough positive examples for algorithm to get a sense of what positive examples are lke; future positive samples likely to be similar to ones in training set.
Examples: Email spam classification, defects detection of mobile phone manufacturing, weather prediction, disease classification.

Anomaly Detection feature selection:
(1) Choose feature with gaussian distribution property (plt.hist(x)). Otherwise, try to transform it:
    (i) log(x + C) : Lower values of C transform better
    (ii) sqrt(x)
    (iii) x ^ 1/3

Anomaly Detection error analysis:
Most common problem: p(x) is comparable for both normal and anomalous samples. p(x) is large. In this case, the algorithm will fail to flag an anomalous test data. Solutions:
(1) Check the anomalous test data. Create new feature which helps detect the anomaly.
(2) Create new feature based on existing ones using simple equations like x1 / x2 or (x1^2) / x2, etc.

Principal Component Analysis:
(1) Commonly used for visualization. Example: Take data with many features, say 50, 1000 or even more, reduce the #features to a few so as to enable visualization / plotting.
    - Common used by data scientists to visualize data for analysis.
    - Choose feature(s) with meaningful degree of variation.
    - Create new feature(s) based on existing ones. Example: length x height
(2) Features should first be normalized to have zero mean before applying PCA.
(3) Features should first be scaled before applying PCA.
(4) It will choose an axis(the Principal Component) around the origin which results in the largest variance when projecting the data onto it (at 90 degree angle).
    - All new axes created will be perpendicular to one another (90-degree)
(5) Less common applications:
    (i) Data compression - Reduce storage and/or transmission cost
    (ii) Speed up training of a supervised learning model.
(6) Matrices have eigenvectors and eigenvalues.
    Eigenvector: Provide direction of uncorrelated features of the data.
    Eigenvalue: The variance of the dataset on those features. It tells the amount of information retained by each feature.
    Dot-product between embeddings and eigenvectors gives projection on uncorrelated features.

=========================================================================================================================================================================================================================================
Neural Network (multilayer perceptron):
- Scale drives DL progess.
  (1) Data - especially labelled data (X,Y).
  (2) Computation - Speeds up Idea -> Code -> Experiment iterations.
  (3) Algorithms - Algorithmic innovation which makes computaion runs faster. For example, switching from Sigmoid to ReLU enables graduent descent to run faster by eliminating the long tails of Sigmoid.
- Every neuron in every NN layer computes 2 things:
  (1) Z[l](i)
  (2) A[l](i) = g(Z)

Non-Linear Activation functions:
- To approximate complex functions / features.
(1) Step function: g(x) = 1 if x >=0, else 0
(2) Logistic Sigmoid:
    - g(x) = e^x / (1 + e^x)
    - derivative is minuscule when z is both -ve/+ve large numbers. Not good for gradient descent.
    - Never use this in the hidden layers. Only in the output layer for binary classification.
    - Derivative = a * (1 - a)
(3) tanh:
    - (e^z - e^(-z)) / (e^z + e^(-z))
    - Similar to sigmoid graph but the output is [-1, 1].
    - Keeps the sign of the input, i.e., if z > 0, g(z) > 0; if z < 0, g(z) < 0.
    - Center at (0,0) means easier for the NN to learn.
    - derivative is minuscule when z is both -ve/+ve large numbers. Not good for gradient descent.
    - Derivative = 1 - a ** 2
(4) Rectified Linear Unit (ReLu):
    - g(x) = max(0,x)
    - derivative = 0 for z <= 0, 1 otherwise.
    - Preferred default choice for hidden layers.
    - Derivative = 1 if z > 0; 0 otherwise
(5) Leaky Rectified Linear Unit (Leaky ReLu):
    - g(x) = max(0.01 * z, x)
    - derivative = 0.01 for z <= 0, 1 otherwise.

Choices of Activation Function:
(1) Output layer: Depends on the nature of the label:
    (i) Binary classification: Sigmoid
    (ii) -/+: Linear. For example, stock price change
    (iii) Only positive numbers: ReLU
(2) Hidden layers: Mostly use LeLU. Reasons:
    (i) Simpler calculation compared to Sigmoid. This makes NN runs faster.
    (ii) There are 2 horizontal asymtotes in Sigmoid. This results in derivatives being close to 0 and therefore make gradient descent slower to run to global minimum.
    Do NOT use linear/identity activation function, g(z) = z in hidden layers. This will result in a linear function and defeats the purpose of NN. The computaion of multiple linear funtions is itself a linear function.
    The "off" or disable feature of the ReLU activation enables models to stitch together linear segments to model complex non-linear functions.
(3) Sigmoid/Softmax [0,1] and tanh [-1, 1] are seldom used in hidden layers due to the asymptotic tails of 0 and 1 and therefore, causing vanishing gradients problems.
(4) ReLU: Derivative = 0 for z <= 0
  - Dying ReLU problem - NN nodes get stuck in on the same weights and stop learning.
Types of Layers:
(1) Dense: Every neuron in a layer computes it's output / activation using ALL inputs/activations from ALL neurons in previous layer.
(2) Convolution: Neurons only look at regions of the data. Benefits:
    (i) Faster computation.
    (ii) Need less training data - less prone to overfitting.

Dimensions:
- n[l] = #units at layer l
- W[l] = dW[l] = (n[l], n[l-i])
- b[l] = db[l] = (n[l], 1)
- A[l] = (n[l], 1)
(1) Stack the input horizontally:
m = #inputs.
Z = W @ X + b
Z: (m, n[l])
A: (m, n[l]) = g(Z)
W: (n[l], n[l-1])
X: (m, n[l-1])
b: (m, n[l])

Z = X @ W.T + b
(m, n[l]) = (m, n[l-1]) @ (n[l], n[l-1]).T + (m, n[l])
          = (m, n[l-1]) @ (n[l-1], n[l]) + (m, n[l])
          = (m, n[l]) + (m, n[l])

Deep vs Shallow NN:
- One comparison is using the Circuit theory to calculate some mathematical function like XOR.
- Deep NN requires O(ln(N)) complexity.
- 1-layer requires exponentially more hidden units to compute. O(2^N) complexity.

h(x1,x2) = g(w0 + w1x1 + w2x2 + ...)
Example:
(1) To model the OR function:
    h(x1,x2) = g(-1 + x1 + x2); w0 = -1, w1=w2=1
(2) To model the AND function:
    h(x1,x2) = g(-2 + x1 + x2); w0 = -2, w1=w2=1
Training - to calculate the parameters:
- Trade-offs between speed and accuracy
(1) Gradient Descent: Algorithm for minimizing loss when training NN
    Start with a random choice of weights, repeat:
    - Calculate the gradient based on ALL data points (training data set): direction that will lead to decreasing loss
    - Update weights according to the gradient
(2) Stochastic Gradient Descent: Same as Gradient Descent but "... based on ONE data point"
(3) Mini-Batch Gradient Descent: Same as Gradient Descent but "... based on ONE SMALL BATCH of data points"
Perceptron: Only capable of learning linearly separable decision boundary
Multilayer NN: Artificial NN with an input layer, an output layer, and at least one hidden layer. Capable of modelling more complex problems compared to perceptron.
To train multi-layer NN, we have to propagate the loss/error from the output back to the hidden layers:
Back Propagation: Algorithm for training NN with hidden layers
- Start with a random choice of weights, repeat:
  - Calculate error for output layer
  - For each layer, starting with output layer, and moving inwards towards the earliest hidden layer
    - Propagate error back one layer
- One step of back propagation will provide a derivative of the output variable with respect to the input variable at that layer.
- Using calculus chain rule, i.e., dJ/da == dJ/dv * dv/da, Backpropagation helps calculate derivative of final output, J, with respect to the input, a.

To prevent underfitting (high bias):
(1) Use a larger network.

To prevent Overfitting (high variance):
(1) Collect more data
(2) Choose lambda appropriately.
Dropout:
- Temporarily removing units - selected at random - from a NN to prevent over-reliance on certain units.
- It spreadas the weights across all activations and not relying on any specific features and this shrinks the weights. Provide similar effects as L2 regularization.
- Prominent implementation is Inverted Dropout:
  - Layers: 3
  - keep_prob = 0.8
  - dropout = numpy.random.rand(A[l].shape[0], A[l].shape[1]) < keep_prob
  - A[l] *= dropout <- This shuts down (1 - keep_prob) % of the activations.
  - A[l] /= keep_prob <- This restores the (1 - keep_prob) values. This is the "inverted dropout".
- J is not well defined with dropouts. This affects debugging of J over epochs.
(3) Data augmentation.
  - Apply library functions on real photos
  - Use GAN.
  - Combination of both - RandAugment

Random Weights Initialization:
- If initialize to zeros, all the hidden units will be computing the same function. dW will have the same rows after every epoch and therefore same as W.
- b is ok to initialize to zeros.
- W = numpy.random.randn((m,n)) * 0.01
  - * 0.01 so as NOT to generate big numbers which will make NN slow to learn when using sigmoid or tanh activation functions.
Vanishing / Exploding Gradients:
- Deep NN very prone to this. If W is slightly < 1, it will vanish, otherwise it will explode going deeper in the the NN due to W @ X.
- To help with this, use random initialization:
  - n: #inputs of the neuron
  (1) ReLU activation:
    - W[l] = numpy.random.randn((shape)) * numpy.sqrt(2 / n[l-1]) <- "He Initialization" (He et al. 2015)
  (2) tanh activation:
    - W[l] = numpy.random.randn((shape)) * numpy.sqrt(1 / n[l-1]). <- Xavier initialization.
- Manifested in a model with low accuracy and worsening performance over time.

playground.tensorflow.org
Image Convolution: Applying a filter that adds each pixel value of an image to it's neighbours, weighted according to a kernel matrix.
                  - Extracts features from input. Output is a feature map
Convolutional NN: NN that uses convolution, usually for analyzing images (i.e., Image Convolution + Pooling in a NN)
                - Training is done to figure out what's the best filters for the input image
                - Models how human look at images
Input image -> [Convolution (calculates filters) -> Pooling (Summarize and reduces the size of inputs)] -> Flattening (Feeds into the inputs of NN)
[Convolution -> Pooling] can be applied multiple times before Flattening. Early layers of this to discover low-level/primitive features (Edges, Curves, Shapes, low-level/primitive audio waveform); later layers to discover high-level features (objects, person, face, basic units of sounds like phonemes, etc)
Note: In a convnet, the higher up a layer is, the more specialized it is. The first few layers in a convnet learn very simple and generic features, which generalize to almost all types of images. But as you go higher up, the features are increasingly specific to the dataset that the model is trained on.

Feed-forward NN: Good for classification. 1:1 mapping from input to output -Single-valued output.
Recurrent NN: Output from the NN is fed to it's input for future calculation. It maintains some states. Good for dealing with sequences of data, both input and/or output. N:N mapping from input to output - Output sequence of values.
              Example applications: Youtube to analyze videos, Google translator, etc
              One example is Long Short Memory Network
=========================================================================================================================================================================================================================================
File formats for training and evaluation.
For large data sets, load the data from data warehouse into SSD storage (Cloud or local). This will prevent underusage of resources, GPU/CPU during the traning/evaluation process.
JSONL: JSON Lines is a simple text-based format with rows. It is human readable and an ideal choice for small to medium-sized datasets.
TFRecord: Binary format and easier to read for computers, ideal for efficient training.
Parquet: Good for large and complex datasets.
=========================================================================================================================================================================================================================================
Data Augmentation:
(1) Image: replicate and distort/transform original data sets
(2) Voice: Add noisy background, audio on bad connection, etc.

Data Synthesis:
(1) Image OCR

Transfer Learning:
(1) Feature-based:
    - Use weights from pre-trained on a completely new model.
    - Example, word embeddings.
    - Need to train the new model from scratch.
(2) Fine-tuning:
    - Fine tune final layers of pre-trained model on similar tasks.
Example: A NN trained on 1 million images of diverse object types (animal, fruits, plants, people, etc). You want to come up with a NN for character OCR.
Make a copy of the NN, replace the output layer with a 10-unit layer. Then, there are 2 options to train the "new" NN:
(1) Fine Tuning. Only train output layer parameters - This works if having a small data set.
(2) Train all parameters - This works with a huge data set. The initial phase of training is called pre-training.
The NN trained on 1 million images is called "supervised pretrained model".
The "new" copy of the NN is called "fine-tuning".
This works because layers of neurons have learnt different parts of the images. This could generalize to other types of images not found in the original dataset.
Only use this when:
(1) The same input types: text, images, audio.
(2) Original task has a lot more data than the new task.
(3) Low level features from original task could be helpful to learn the new task.
Use-case exmaples: GPT-3, BERT, imageNet, Speech-recognition -> wake/trigger word detection, Image classification to radiology classification.

Multi-task Learning:
- One NN do several things at the same time. Every task helps, hopefully, all of the other tasks.
- Same NN when all the tasks share similar or same low-level/primitive features / early layers of the NN. The output layer will be y^ with shape (n,m). n: #tasks, m: #samples.
  - Loss function = sum(sum(L(y^(j)(i), y(j)(i)) for j:[i:n] )) for i: [1:m]
  -  L is the usual logistic loss: -y(j)(i) * log(y^(j)(i)) - (1 - y(j)(i)) * log(y^(j)(i))
  - Benefit: Works well for partially labelled data. The loss function sum over j with valid value of [0,1] and omit inf, -inf, or NaN.
Only use this when:
- Tasks having similar low-level/primitive features.
- Amount of data for each task is quite similar. The aggregate of N-1 tasks' data should be a lot more than the remaining 1 other task. Symmetrically, data of N-1 tasks augment training of every other tasks.
- Can train a big enough NN to do well on ALL of the tasks. This NN should perform better than having a separate NN for each of the tasks.

End-to-End Deep Learning:
- A NN that replaces a traditional multi-staged learning pipeline and removes intermediate steps.
- Only use this when there is large engough end-to-end dataset (X,Y).
- Carefully choose what types of X->Y mappings to learn depeneding on what task the data could be obtained for.
Pros:
(1) Let the data speak for itself.
(2) Reduce hand-crafted / manual components. Dont' force the ML to learn the archaic ways.
Cons:
(1) Excludes potentially useful hand-crafted / manual components. This is especially true when there is lack of end-to-end data. The NN don't have sufficient data to learn the features to map X to Y in this case.
(2) Sometimes divide-and-conquer works better. Need to weight the relative complexity vs data availability.

Skewed Datasets:
Precision = #True positives / (#True + #False positives) = #True positives / (#Total predicted positives)
Recall = #True positives / (#True positives + #False negatives) = #True positives / (#Total actual positives) - Helps detect if the learning algorithm is predicting negatives all the time because the value will be zero.
0 Precision and/or Recall is bad.
For use cases with skewed classes or a rare class, decently high precision and recall values helps reassures the usefulness of the learning algorithm.

Trade-off between Precision and Recall:
Logistic regression: 0 <= F(w,b) <= 1
Predict 1 if F(w,b) >= threshold
Predict 0 if F(w,b) < threshold
To predict 1 only if very confident, use high value of threshold. This results in high precision, low recall
To predict 1 even when in doubt, use low value of threshold. This results in low precision, high recall

Picking the threshold is not something can be done with cross-validation. It's up to the business needs / use-case.
To automatically decide on the best learning algorithm without having to manually select between Precision and Recall, choose the algorithm with the highest F1 score.
F1 score = 2 * PR / (P + R) <= Harmonic mean. A mean calculation which pays attention to the lower value.

Classification model evaluation metrics:
(1) Accuracy
(2) Area under ROC (Receiver Operating Characteristics) curve - binary classification models.
    - Comparison of a model's true-positive rate to false-positive rate at different classification thresholds.
    - The AUC metric tells you how well your model is at choosing between classes (for example, how well it is at deciding whether someone has heart disease or not). A perfect model will get an AUC score of 1.
(3) Confusion matrix
(4) Classification Report

Regression model evaluation metrics:
(1) R^2
(2) Mean absolute error (MAE)
(3) Mean squared error (MSE)

Tune model hyperparameters on Evaluation dataset.
3 ways to adjust hyperparameters:
(1) Manually
(2) Randomly with RandomSearchCV
(3) Exhaustively with GridSearchCV - Brute force, trying every single combination.
Use appropriate scale for fair random search.
- Some paramameters like beta in Exponential Weighted Average algorithm, is sensitive when it is very close to 1. Linear search will not provide fair search in the tiny range of valid values of concern.
- Use log scale. Example for learning_rate [0.0001, 1]:
  - r = -4 * numpy.random.rand() <= [-4, 0]
    alpha = 10^r <= [0.0001, 1]

2 ways to save and load models:
(1) With Python pickle module.
(2) With joblib module
=========================================================================================================================================================================================================================================
Decision Tree Learning:
- Can model non-linear associations / capture non-linear relationships in data.
- Boundaries are always vertical or horizontal
Decision 1: How to choose what feature to split on at each node?
            Goal: Maximize purity / Minimize impurity - decision which gives fastest path to leave nodes.
Decision 2: When do you stop splitting?
            (1) When a node is 100% one class
            (2) Tree depth. Deep tree could result in overfitting / high variance.
            (3) When improvement on purity score are below a threshold.
            (4) When the #samples in a node is below a threshold.
Entropy as a measure of impurity
p1 = Fraction of samples that are positive
p0 = 1 - p1
H(p1) = -p1log2(p1) - (1 - p1)log2(1 - p1)
      = -p1log2(p1) - p0log2(p0)
Note:
(1) Use log2 to make the symmetric bell curve range from 0 to 1
(2) 0log2(0) = 0 <- log(0) is negative infinity

Decision 1: Goal: Reduce entropy, H(p1) = Information gain.
Information Gain = H(p1root) - (wl * H(p1l) + wr * H(p1r))
wl = weight of left branch = #left samples / #root samples
wr = weight of right branch = #right samples / #root samples

Regression Tree: Decision Tree which predicts a number.
- Use variance of the sample labels to decide which features to split on.
- Split on feature which gives highest reduction in variance (highest information gain)
Information Gain = V(p1root) - (wl * V(p1l) + wr * V(p1r))

Decision Tree Classifier:
- Partitioning the input feature space into regions.

Any single decision tree is highly sensitive to small change in the data. Solution is to use tree ensemble to let multiple trees vote on the final prediction.
(1) Use sampling with replacement to build data for tree ensemble. This is called "bagged" decision tree (https://medium.com/data-science/understanding-sampling-with-and-without-replacement-python-7aff8f47ebe4)
(2) Random forest - Randomizing the feature selection. Select from k < n features at each node. k is usually sqrt(n). Works better for large feature set.
(3) XGBoost - Instead of picking from all examples with equal 1/m probability, make it more likely to pick misclassified samples from all the trees constructed so far.

Decision Trees vs. Neural Networks
Decision Trees and Tree Ensembles:
(1) Works well on structured data
(2) Not recommended for Unstrcutured data (video, audio, images, text)
(3) Fast to train
(4) Small decision trees maybe human interpretable.

NN:
(1) Works well on ALL types of data and the combination of them.
(2) Maybe slower to train compared to decision trees
(3) Works with transfer learning - Can use pretrained models to train the output layer with fewer data.
(4) Can string together multiple NN when building a system of multiple models working together.
=========================================================================================================================================================================================================================================
Collabortive Filtering: Recommend items to user based on ratings of other users who gave similar ratings as the user.
Content-based Filtering: Recommend items to user based on features of both the user and the items to find a good match.
  - R[i,j]=1 if user[j] has rated item[i]
  - Y[i,j] ratings of user[j] on item[i]
  User and item will have different sizes of features. Prediction can be simplified to numpy.dot(W[j], X[i]). bias doesn't affect the performance of the model much.
  For the calculation to work, need to transform the feature vectors to vectors of same dimension. Use NN to achieve this. Call it User NN and Item NN with the output layers having the same #neurons.
  These 2 NNs are trained as a single NN with ONE cost function J = sum((V[u,j] * V[m,i] - Y[i,j]) ** 2) + regularization for all R[i,j] = 1 (Use tf.keras.layers.Dot layer)
Recommendation from LARGE Catalogue - can be done with 2 steps: Retrieval and Ranking.
Retrieval:
(1) Generate last list of plausible item candidates:
    - For each item purchased by the user, find 10 most similar items. This can use overnight batch to calculate the min(|V[k] - V[i]| ** 2)
    - For most viewed 3 categories/genres, find top 10 items.
    - Top 20 items in the country/region.
(2) Combine the retrieved items into a list, removing duplicates and items already purchased.
Ranking:
(1) Use the NN to calculate predictions on the retrieved list. Only need to calculate the missing V[j] or V[i] for the final dot product. Can be done on separated NN. This should be quick.
(2) Display ranked items to the user.
To speed up the response times of the system, pre-compute vectors V[i] for ALL items that it might recommend. This can be done even before a user logs in or even before Xu or Vu vector is computed.
If 2 items have vectors V[i] and V[k] which are similar to one another, i.e., |V[k] - V[i]| is small, a conclusion could be drawn that they will be liked by similar users.
=========================================================================================================================================================================================================================================
Reinforement Learning
State action value function (Q-function)
Q(s,a) = Return if it, the agent:
(1) Start at state s
(2) Take action a (ONCE)
(3) Behave optimally after that: max(Q(s', a'))
The best possible return from state s is max(Q(s,a))
The best possible action in state s is the action which  gives max(Q(s,a))

Policy, pi(s) = a It's job is to take as input any state, s, and map it to some action that it wants the agent to take in that state.
Goal of Reinforcement learning: Find a policy, pi, which tells the agent what action (a = pi(s)) to take in every state (s) so as to maximize the return.
Bellman Equation: Q(s,a) = R(s) + gamma * max(Q(s', a'))
gamma is discount factor [0, 1]: smaller value means higher discount and makes the agent impatient, larger value makes it more patient to obtain reward at later steps. Usually close to 1. Example, 0.9, 0.99, etc.

In random/Stochastic environment, there will be multiple sequence of different rewards due to uncontrolled environment / misstep probabilities. A Policy execution could produce many different results.
In this case, choose a Policy (pi) which maximizes the average/expected sums of discounted rewards.
Bellman Equation: Q(s,a) = R(s) + gamma * E(max(Q(s', a'))); E = Average

In continuous state-space, s is a vector of numbers (some continuous numbers, some binary). Encode the actions (a) using one-hot feature vector.
Use Deep Reinforcement Learning to compute Q(s,a) at the current state (s). X = [s a]. In a state, s, use the NN to compute Q(s, a) of all possible actions to take. Then, pick the action which maximizes Q(s,a).
Use Bellman Equation to generate training datasets with Q(s,a) = X; R(s) + gamma * max(Q(s',a')) = Y

Learning Algorithm:
Input: [s, a]; Output: Q(s,a)
Initialize NN randomly as guess of Q(s,a).
Repeat {
  Take actions wuth the agent. Get (s,a,R(s),s') tuples. <= **
  Store 10,000 most recent tuples. (Replay Buffer)

  Train NN:
    Create training set of 10,000 samples using x = (s,a); y = R(s) + gamma * max(Q(s', a'))
    Train Qnew such that Qnew(s,a) ~= y (f_w_b(x) ~= y)
  Set Q = Qnew ***
}
Q will improve after every training. This is Deep Q-Learning/Q-Network and can be trained by adjusting it's weights at each iteration to minimize the mean-squared error in the Bellman equation.

To be more efficient, X = [s] and Y = Q(s, a) of all possible actions. Output layer has the number of neurons equal to #actions. With this a single inference will generate all the possible Q(s,a) with all possible actions. Then, the max(Q(s',a')) term is readily available.

Using neural networks in reinforcement learning to estimate action-value functions has proven to be highly unstable. Here is why:
Bellman equation error: Y - Q(s,a;w) = R(s) + gamma * max(Q(s', a'; w)) - Q(s,a; w).
Notice that Y will keep changing in every iteration. Having a constantly moving target can lead to oscillations and instabilities. To avoid this, we can create a separate neural network for generating the Y targets. We call this separate neural network the target Q-Network and it will have the same architecture as the original Q.
Steps:
Every C time steps we will use the Q-target-Network to generate the Y targets and update the weights of the Q-target-Network using the weights of the 𝑄-Network. We will update the weights of the the Q-target-Network using a soft update.

To avoid instabilities, use a Target Network and Experience Replay.

** How to choose actions while still learning? 2 options:
(1) Pick the action a which maximizes Q(s,a) - Because of random initialization, the NN might get stuck in it's own misconception that some actions (a) are *always* bad and therefore, would never get a chance to learn them well.
(2) epsilon-greedy policy (e = 0.05)
    (i) Greedy/Exploitation: With probabiity 0.95, Pick the action a which maximizes Q(s,a)
    (ii) Exploration: With probability 0.05, pick an action randomly.
    Start e high (1.0) and gradually decreases it to 0.01 in the course of the training.

Mini-batch helps in both Supervised and Reinforcement Learning when datasets is large and it takes every iteration to compute sum((f_w_b(X) - y) ** 2)/2m over large datasets and therefore takes very long time to converge and computationally expansive. However, mini-batch will exhibit some unreliable behaviour ("noisy") in graident descent although eventually it will converge faster compared to usual batch learning.
*** To make the NN converge more reliably, soft-update learning will help:
(1) Prevent abrupt changes to the NN
(2) Prevent one worse learning step overwrite the existing better NN
Q(w,b) = Qnew(w_new, b_new)
W = 0.01 * Wnew + 0.99 * W
B = 0.01 * Bnew + 0.99 * B
0.01 and 0.99 adds up to 1 and are the hyperparameters which control how aggressively or gradually the training updates the NN parameters.

In machine learning, logits are the raw, unnormalized scores or outputs from a neural network's final layer before an activation function like softmax or sigmoid is applied.
Binary Classification uses Sigmoid activation function and BinaryCrossEntropy loss function.
Multi-class Classification uses Softmax activation function and CategoricalCrossEntropy loss function.

Input: 128x128 greyscale image
Dense: 256 neurons
#parameters = 128 * 128 * 256 + 256 = 4194560 parameters
================================================================================================================================================================================
Convolution (C4_W1.pdf):
Early/Shallow layers often see the finer details with simpler shapes and patterns in smaller regions of the input (low-level/primitive features) such as edges and simple textures.
Later/Deeper layers often see more complex shapes and patterns in larger regions of the input (high-level features) such as more complex textures and object classes.
In other words, early/shallow layers zoom in, later/deeper layers zoom out.

Vertical edge detection filter: [[1, 0, -1],
                                 [1, 0, -1],
                                 [1, 0, -1]]
Horizontal edge detection filter: [[1, 1, 1],
                                   [0, 0, 0],
                                   [-1, -1, -1]]
Sobel (vertical) filter: [[1, 0, -1],
                          [2, 0, -2],
                          [1, 0, -1]]
Schorr (vertical) filter: [[3,  0, -3],
                           [10, 0, -10],
                           [3,  0, -3]]
NN to learn the filter parameters:  [[w1, w2, w3],
                                     [w4, w5, w6],
                                     [w7, w8, w9]] to detect various types of edges at various orientations.

Padding:
- Usually the middle of the input images is emphasized more than the edges.
- To give equal emphasis/importance to the edges, add padding around the input image.
To prevent data source shrinking, use padding.
"Valid" convolution:
- No padding.
- nxn * fxf => n-f+1 * n-f+1

"Same" convolution:
- Use padding, p, so that the output size is the same as the input size.
- n+2p-f+1 * n+2p-f+1
- for output size equals to n, n+2p-f+1 = n => p = (f-1)/2

By convention, in computer vision, f is always odd number. Reasons being:
(1) It results in p being a whole number. This gives symmetric padding around the data source.
(2) It produces a central cell in the filter which facilitates identifiying the position of the filter.

Strided convolution:
- Instead of moving the filter by 1 position, move by the number of strides, s.
- Output size: floor(((n + 2p) / s) + 1) * floor(((n + 2p) / s) + 1). Use of floor denotes that the convolution is only carried out of the filter is completely inside the image including the padding, if required. None of the filter should be hanging outside of the image.

Convolution over volume:
For example, for images with 3 channels, RGB, the filter must also have the same number of channels. The output will NOT have any channel, i.e., 1 dimension less compared to the input.

Multiple filters:
Each filter works on different features and produces one output. All output will be stacked to form the final output.

f[l] = filter size
p[l] = padding
s[l] = stride
Nc[l] = filter#
Each filter: f[l] x f[l] x Nc[l-1]
Activations: a[l] -> Nh[l] x Nw[l] x Nc[l]
Weights: f[l] x f[l] x Nc[l-1] x Nc[l]
bias: Nc[l]; The bias vector is b, where each filter has its own (single) bias.

Input:  Nh[l-1] x Nw[l-1] x Nc[l-1]
Output: Nh[l] x Nw[l] x Nc[l]
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1
Nc[l] = Filter#

One property of convolution network is that the number of parameters are independent of the size of the input. This reduces overfitting.

Example parameters# calculation:

Input: 256x256 RGB
First hidden layer: 64 neurons.
NO Convolution
Bias: Nc[l] = 64
This hidden layer has 256 * 256 * 3 * 64 + 64 = 12582976 parameters

Conv1D (C4_W4.pdf page 40):
Input: 128-length vectors with 1 timesteps
Filter: 5x1
Activations:N[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(128 - 5) + 1 = 124

Input: 14x14x3
Filter: 1 5x5x3
p = (f-1)/2 = 4/2 = 2
Activations:
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(14 - 5) + 1 = 10
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(14 - 5) + 1 = 10
10 * 10 = 100
Weights: 5x5x1 = 25
Bias: Nc[l] = 1
This hidden layer has 25 + 1 = 26 parameters

Input: 256x256 grayscale image
Filter: 128 3x3
  Each filter: 3x3x1 = 9
Activations:
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(256 - 3) + 1 = 254
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(256 - 3) + 1 = 254
254x254 = 64516
Weights: 3x3x1x128 = 1152
Bias: Nc[l] = 128
This hidden layer has 1152 + 128 = 1280 parameters

Input: 256x256 RGB
Filter: 128 7x7
  Each filter: 7x7x3 = 147
Activations:
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(256 - 7) + 1 = 250
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(256 - 7) + 1 = 250
250*250*128 = 8000000
Weights: 7x7x3x 128 = 18816
Bias: Nc[l] = 128
This hidden layer has 18816 + 128 = 18944 parameters

Example output volume calculation:
Input: 121x121x16
Filter: 32 4x4 s:3 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(121 - 4)/3 + 1 = 40
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(121 - 4)/3 + 1 = 40
Output volume = Nh[l] x Nw[l] x Nc[l] = 40 * 40 * 32 = 51200

Input: 63x63x16
Filter: 32 7x7 s:2 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(63 - 7)/2 + 1 = 29
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(63 - 7)/2 + 1 = 29
Output volume = Nh[l] x Nw[l] x Nc[l] = 29 * 29 * 32 = 26912

Input: 128x128x12
Max Pooling: filter size:4, stride: 4
Nh[l] = floor((n+2p-f) / s) + 1 = floor((128 - 4)/4) + 1 = floor(124/4) + 1 = 31 + 1 = 32
Nw[l] = floor((n+2p-f) / s) + 1 = floor((128 - 4)/4) + 1 = floor(124/4) + 1 = 31 + 1 = 32
Output volume = Nh[l] x Nw[l] x Nc[l] = 32 x 32 x 12

Input: 66x66x21
Max Pooling: filter size:3, stride: 3
Nh[l] = floor((n+2p-f) / s) + 1 = floor((66 - 3)/3) + 1 = floor(63/3) + 1 = 31+1 = 32
Nw[l] = floor((n+2p-f) / s) + 1 = floor((66 - 3)/3) + 1 = floor(63/3) + 1 = 31+1 = 32
Output volume = Nh[l] x Nw[l] x Nc[l] = 22 x 22 x 21

Input: 3D data with 64x64x64x3
Filter: 4x4x4
#filters = 16
Padding: 0
Stride: 2
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(64 - 4)/2 + 1 = 31
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(64 - 4)/2 + 1 = 31
Nd[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(64 - 4)/2 + 1 = 31
Output volume = Nh[l] x Nw[l] x Nd[l] x Nc[l] = 31 x 31 x 31 x 16

2D:
Input: 14x14x3
Filter: 1 5x5x3 s:1 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(14 + 0 - 5)/1 + 1 = 10
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(14 + 0 - 5)/1 + 1 = 10
Output volume = Nh[l] x Nw[l] x Nc[l] = 10 x 10 x Nc[l]

Input: 10x10x16
Filter: 1 5x5x16 s:1 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(10 + 0 - 5)/1 + 1 = 6
Nw[l] = floor((Nw[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(10 + 0 - 5)/1 + 1 = 6
Output volume = Nh[l] x Nw[l] x Nc[l] = 10 x 10 x Nc[l]

1D:
Input: 14x1
Filter: 5x1 s:1 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(14 + 0 - 5)/1 + 1 = 10
Output volume = Nh[l] x Nw[l] x Nc[l] = 10 x Nc[l]

Input: 10x16
Filter: 5x16 s:1 p:0
Nh[l] = floor((Nh[l-1] + 2p[l] - f[l]) / s[l]) + 1 = floor(10 + 0 - 5)/1 + 1 = 6
Output volume = Nh[l] x Nw[l] x Nc[l] = 6 x Nc[l]

Typically in Convolutional NN, the size of the data will reduce deeper into the NN while the filter# will increase.

Types of layers in a Convolutional NN:
(1) Convolution
    - Relatively few parameters.
(2) Pooling
    - Reduce the size of the representatio
    - Speed up computation.
    - Makes some of the features that it detects more robust.
(3) Fully Connected (Dense)

Pooling:
- Reducing the size of an input by sampling from regions in the input
- Get the most salient from the input.
- Preserve detected features.
- Reduce the size of the input.
(1) Max or Average.
(2) Hyperparameters: f and s
(3) No parameters to learn. Gradient descent does not learn or change anything of it in this layer.
(4) Padding. Usually not used with Max pooling.
(5) Reduces/Shrinks Nh[l] and Nw[l]

1x1 convolution:
- To shrink channel#, use 1x1 filter. Example: (28,28,192) -> (28,28,32) can be achieved by a convolution layer of 32 x (1,1,192) filters with ReLU activation. This 1x1 conv filter is also called the bottleneck layer.
- When implemented within reason, this bottleneck layer helps reduce computational costs significantly by shrinking down the representation size significantly without affecting the performance of the NN.