AI Training - Max #59

agnusmaximus · 2023-11-14T21:31:25Z

Before submitting your Pull Request, please ensure that you have carefully reviewed and completed all items on this checklist.

Content
- The chapter content is complete and covers the topic in detail.
- All technical terms are well-defined and explained.
- Any code snippets or algorithms are well-documented and tested.
- The chapter follows a logical flow and structure.
References & Citations
- All references are correctly listed at the end of the chapter.
- In-text citations are used appropriately and match the references.
- All figures, tables, and images have proper sources and are cited correctly.
Quarto Website Rendering
- The chapter has been locally built and tested using Quarto.
- All images, figures, and tables render properly without any glitches.
- All images have a source or they are properly linked to external sites.
- Any interactive elements or widgets work as intended.
- The chapter's formatting is consistent with the rest of the book.
Grammar & Style
- The chapter has been proofread for grammar and spelling errors.
- The writing style is consistent with the rest of the book.
- Any jargon is clearly explained or avoided where possible.
Collaboration
- All group members have reviewed and approved the chapter.
- Any feedback from previous reviews or discussions has been addressed.
Miscellaneous
- All external links (if any) are working and lead to the intended destinations.
- If datasets or external resources are used, they are properly credited and linked.
- Any necessary permissions for reused content have been obtained.
Final Steps
- The chapter is pushed to the correct branch on the repository.
- The Pull Request is made with a clear title and description.
- The Pull Request includes any necessary labels or tags.
- The Pull Request mentions any stakeholders or reviewers who should take a look.

pongtr

Looks like references and links are still WIP so won't focus on that

pongtr · 2023-11-30T22:25:42Z

training.qmd

+* **Data is crucial:** Machine learning models learn from examples in training data. More high-quality, representative data leads to better model performance. Data needs to be processed and formatted for training.
+* **Algorithms learn from data:** Different algorithms (neural networks, decision trees, etc.) have different approaches to finding patterns in data. Choosing the right algorithm for the task is important.
+* **Training refines model parameters:** Model training adjusts internal parameters to find patterns in data. Advanced models like neural networks have many adjustable weights. Training iteratively adjusts weights to minimize a loss function.
+* **Generalization is the goal:** A model that overfits to the training data will not generalize well. Regularization techniques (dropout, early stopping, etc.) reduce overfitting. Validation data is used to evaluate generalization.
+* **Training takes compute resources:** Training complex models requires significant processing power and time. Hardware improvements and distributed training across GPUs/TPUs have enabled advances.


Suggestion: add links to other sections of the book or subsections within chapter to help reader jump around

eg. these points map onto the subsections below

good suggestion, I tried it but given the latest updates it doesn't seem to work well

pongtr · 2023-11-30T22:34:44Z

training.qmd

+We will walk you through these details in the rest of the sections. Understanding how to effectively leverage data, algorithms, parameter optimization, and generalization through thorough training is essential for developing capable, deployable AI systems that work robustly in the real world.
+
+
+## Mathematics of Neural Networks


Looks like this section has overlaps with section 3 "Deep Learning Primer" though it goes into more detail here.

I suggest updating Section 3 to point to specific parts on this page.

Good catch, thanks @pongtr - will do!

pongtr · 2023-11-30T22:43:45Z

training.qmd

+
+After defining our neural network, we are given some training data, which is a set of points ${(x_j, y_j)}$ for $j=1..M$, and we want to evaluate how good our neural network is on fitting this data. To do this, we introduce a **loss function**, which is a function that takes the output of the neural network on a particular datapoint ($N(x_j; W_1, ..., W_n)$), and compares it against the "label" of that particular datapoint (the corresponding $y_j$), and outputs a single numerical scalar (i.e: one real number) that represents how "good" the neural network fit that particular data point; the final measure of how good the neural network is on the entire dataset is therefore just the average of the losses across all datapoints.
+
+There are many different types of loss functions, for example, in the case of image classification, we might use the cross-entropy loss function, which tells us how good two vectors that represent classification predictions compare (i.e: if our prediction predicts that an image is more likely a dog, but the label says it is a cat, it will return a high "loss" indicating a bad fit).


Could be helpful to show example with smaller "loss" to allow comparison.
E.g. predict "pug" but label was "bulldog"

pongtr · 2023-11-30T23:49:11Z

training.qmd

+* All gradients with respect to the weights should have the same shape as the weight matrices themselves
+:::
+
+The entire backpropagation process can be complex, especially for networks that are very deep. Fortunately, machine learning frameworks like PyTorch support automatic differentiation, which performs backpropagation for us. In these machine learning frameworks we simply need to specify the forward pass, and the derivatives will be automatically computed for us. Nevertheless, it is beneficial to understand the theoretical process that is happening under the hood in these machine-learning frameworks.


Link to Frameworks section

thanks! done!

pongtr · 2023-11-30T23:51:24Z

training.qmd

+| Data Split | Purpose | Typical Size |
+|-|-|-|  
+| Training Set | Train the model parameters | 60-80% of total data |
+| Validation Set | Evaluate model during training to tune hyperparameters and prevent overfitting | ∼20% of total data |
+| Test Set | Provide unbiased evaluation of final trained model | ∼20% of total data |


Representing this in the table makes it easy to read and compare 👍

pongtr · 2023-12-01T00:22:36Z

training.qmd

+Proper weight initialization helps in overcoming issues like vanishing or exploding gradients, which can hinder the learning process. Here are some commonly used neural network weight initialization techniques:

- Computational Constraints
- Data Privacy
- Ethical Considerations

-## Conclusion
+ as orthogonal matrices. This helps in preserving the gradients during backpropagation and can be particularly useful in recurrent neural networks (RNNs).
+ Uniform and Normal Initialization:
+
+Weights are initialized with random values drawn from a uniform or normal distribution. The choice between uniform and normal depends on the specific requirements of the model and the activation functions used.
+Choosing the right initialization method depends on the architecture of the neural network, the activation functions, and the specific problem being solved. Experimentation with different techniques is often necessary to find the most suitable initialization for a given scenario.


Looks like something funky going on here when text was copied over

Cleaned it all up in the latest version

pongtr · 2023-12-01T00:24:28Z

training.qmd

+Activation functions play a critical role in neural networks by introducing non-linearities into the model. These non-linearities enable neural networks to learn complex relationships and patterns in data, making them capable of solving a wide range of problems. Here are some commonly used activation functions:

-Explanation: A summary helps to consolidate the key points of the chapter, aiding in better retention and understanding of the material.
+* Rectified Linear Unit (ReLU):
+
+    ReLU is a popular activation function that returns zero for
+    negative input values and passes positive input values
+    unchanged. It is computationally efficient and has been widely
+    used in deep learning models.
+
+* Sigmoid Function
+
+    The sigmoid function, also known as the logistic function,
+    squashes input values between 0 and 1. It is often used in the
+    output layer of binary classification models, where the goal is to
+    produce probabilities.
+
+* Hyperbolic Tangent Function (tanh):
+
+    The hyperbolic tangent function is similar to the sigmoid but
+    squashes input values between -1 and 1. It is often used in hidden
+    layers of neural networks, especially when zero-centered outputs
+    are desired.
+
+![Common activation functions](https://1394217531-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvO3qs2RImYjpBE8vln%2Factivation-functions3.jpg?alt=media&token=f96a3007-5888-43c3-a256-2dafadd5df7c){width=70%}


Diagram shows Linear as well. Could be worth discussing that here as well about when it might be desired

Good point, done - thanks!

pongtr · 2023-12-01T00:27:42Z

training.qmd

+performance. Generally, a good rule of thumb for batch size is between
+8-128.
+
+### Optimizing Matrix Multiplication


Can link to AI Acceleration section

pongtr · 2023-12-01T00:29:59Z

training.qmd

+
+## Training Parallelization
+
+Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below.


Suggested change

Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below.

Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallelism and Model Parallelism, discussed below.

Do you mean "Parallelism" / "Parallelization"?

training.qmd

…into pr/59

…into pr/59-training

Maximilian Lam added 2 commits November 14, 2023 12:54

First pass at training

3e8ced1

Upd

5c26613

profvjreddi added cs249r new new course content and removed cs249r labels Nov 14, 2023

profvjreddi marked this pull request as draft November 14, 2023 21:36

profvjreddi and others added 17 commits November 14, 2023 19:52

Added section place holders

5af7468

Updated the introduction

624e6c1

Add overview

bf96aca

Updated introduction

ac1f659

updated NN intro

aab7018

Minor rework of the writing

fb27e38

Merge branch 'main' into pr/59

5c0b895

Added training data content

65b4812

Focusing on the big items we cover in the chapter

3c67582

Update backprop

d3902ee

Markdown

307dafb

Upd training

cf3b67f

Upd training

0d0c613

Upd training

893b750

Upd training

fe2a424

Upd training

79757d1

Upd training

733ec36

uchendui force-pushed the main branch from 8b5ca4c to 0d2be72 Compare November 22, 2023 15:59

profvjreddi and others added 2 commits November 26, 2023 19:41

Merge branch 'main' into pr/59

c58a6d0

aitraining images

c7be42a

profvjreddi added this to the First public release v0.0.0 milestone Nov 29, 2023

profvjreddi added 3 commits November 29, 2023 19:19

Expanded the algorithms section and added references

20fe3b7

Merge branch 'main' into training

752d3e4

Updated cover image

e382921

Made a pass on the hyperparameter section

7412fe1

pongtr reviewed Dec 1, 2023

View reviewed changes

uchendui force-pushed the main branch from d1732b2 to a21f2a5 Compare December 2, 2023 18:01

profvjreddi and others added 24 commits December 5, 2023 09:57

Merge branch 'main' into pr/59

7beb911

References training

c2e48ec

Adding in more details about regularization

bdc4f3f

Minor updates to Regularization section

55e4171

Weight initialization update pass

49794e9

Making weight initialization connections to weight init

076a599

Update training.qmd

4f7c6bc

fix bibtex reference for he et al.

3fe19f4

Added details to activation functions

bfbdc4c

references

7307e84

references bib

f5c468a

references

ea66794

Improving the MM section with details

cecadb0

MD fixes

14f6950

Merge branch 'training' of https://github.com/agnusmaximus/cs249r_book …

ad05af0

…into pr/59

sorting references

a60904a

Improving the optimizations section

1bc8513

Roofline / big batch-size training

a42d79d

training

d92a50b

Updated the training parallelization section

a16a060

Merge branch 'training' of https://github.com/agnusmaximus/cs249r_book …

bff3e9e

…into pr/59-training

Updated the learning objectives.

606cee9

Updated the conclusion section.

8c0c3bf

MD fixes

5817210

profvjreddi marked this pull request as ready for review December 7, 2023 15:39

profvjreddi merged commit 4b42643 into harvard-edge:main Dec 7, 2023
3 checks passed

uchendui added the cs249r label Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Training - Max #59

AI Training - Max #59

agnusmaximus commented Nov 14, 2023 •

edited by profvjreddi

Loading

pongtr left a comment

pongtr Nov 30, 2023

profvjreddi Dec 7, 2023

pongtr Nov 30, 2023

profvjreddi Dec 7, 2023

pongtr Nov 30, 2023

pongtr Nov 30, 2023

profvjreddi Dec 7, 2023

pongtr Nov 30, 2023

pongtr Dec 1, 2023

profvjreddi Dec 7, 2023

pongtr Dec 1, 2023

profvjreddi Dec 7, 2023

pongtr Dec 1, 2023

pongtr Dec 1, 2023

		We will walk you through these details in the rest of the sections. Understanding how to effectively leverage data, algorithms, parameter optimization, and generalization through thorough training is essential for developing capable, deployable AI systems that work robustly in the real world.


		## Mathematics of Neural Networks


		After defining our neural network, we are given some training data, which is a set of points ${(x_j, y_j)}$ for $j=1..M$, and we want to evaluate how good our neural network is on fitting this data. To do this, we introduce a loss function, which is a function that takes the output of the neural network on a particular datapoint ($N(x_j; W_1, ..., W_n)$), and compares it against the "label" of that particular datapoint (the corresponding $y_j$), and outputs a single numerical scalar (i.e: one real number) that represents how "good" the neural network fit that particular data point; the final measure of how good the neural network is on the entire dataset is therefore just the average of the losses across all datapoints.

		There are many different types of loss functions, for example, in the case of image classification, we might use the cross-entropy loss function, which tells us how good two vectors that represent classification predictions compare (i.e: if our prediction predicts that an image is more likely a dog, but the label says it is a cat, it will return a high "loss" indicating a bad fit).


		## Training Parallelization

		Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below.

AI Training - Max #59

AI Training - Max #59

Conversation

agnusmaximus commented Nov 14, 2023 • edited by profvjreddi Loading

pongtr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnusmaximus commented Nov 14, 2023 •

edited by profvjreddi

Loading