Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI Training - Max #59

Merged
merged 51 commits into from
Dec 7, 2023
Merged

AI Training - Max #59

merged 51 commits into from
Dec 7, 2023

Conversation

agnusmaximus
Copy link

@agnusmaximus agnusmaximus commented Nov 14, 2023

Before submitting your Pull Request, please ensure that you have carefully reviewed and completed all items on this checklist.

  1. Content

    • The chapter content is complete and covers the topic in detail.
    • All technical terms are well-defined and explained.
    • Any code snippets or algorithms are well-documented and tested.
    • The chapter follows a logical flow and structure.
  2. References & Citations

    • All references are correctly listed at the end of the chapter.
    • In-text citations are used appropriately and match the references.
    • All figures, tables, and images have proper sources and are cited correctly.
  3. Quarto Website Rendering

    • The chapter has been locally built and tested using Quarto.
    • All images, figures, and tables render properly without any glitches.
    • All images have a source or they are properly linked to external sites.
    • Any interactive elements or widgets work as intended.
    • The chapter's formatting is consistent with the rest of the book.
  4. Grammar & Style

    • The chapter has been proofread for grammar and spelling errors.
    • The writing style is consistent with the rest of the book.
    • Any jargon is clearly explained or avoided where possible.
  5. Collaboration

    • All group members have reviewed and approved the chapter.
    • Any feedback from previous reviews or discussions has been addressed.
  6. Miscellaneous

    • All external links (if any) are working and lead to the intended destinations.
    • If datasets or external resources are used, they are properly credited and linked.
    • Any necessary permissions for reused content have been obtained.
  7. Final Steps

    • The chapter is pushed to the correct branch on the repository.
    • The Pull Request is made with a clear title and description.
    • The Pull Request includes any necessary labels or tags.
    • The Pull Request mentions any stakeholders or reviewers who should take a look.

@profvjreddi profvjreddi added cs249r new new course content and removed cs249r labels Nov 14, 2023
@profvjreddi profvjreddi marked this pull request as draft November 14, 2023 21:36
Copy link
Contributor

@pongtr pongtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like references and links are still WIP so won't focus on that

Comment on lines +34 to +38
* **Data is crucial:** Machine learning models learn from examples in training data. More high-quality, representative data leads to better model performance. Data needs to be processed and formatted for training.
* **Algorithms learn from data:** Different algorithms (neural networks, decision trees, etc.) have different approaches to finding patterns in data. Choosing the right algorithm for the task is important.
* **Training refines model parameters:** Model training adjusts internal parameters to find patterns in data. Advanced models like neural networks have many adjustable weights. Training iteratively adjusts weights to minimize a loss function.
* **Generalization is the goal:** A model that overfits to the training data will not generalize well. Regularization techniques (dropout, early stopping, etc.) reduce overfitting. Validation data is used to evaluate generalization.
* **Training takes compute resources:** Training complex models requires significant processing power and time. Hardware improvements and distributed training across GPUs/TPUs have enabled advances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add links to other sections of the book or subsections within chapter to help reader jump around

eg. these points map onto the subsections below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, I tried it but given the latest updates it doesn't seem to work well

We will walk you through these details in the rest of the sections. Understanding how to effectively leverage data, algorithms, parameter optimization, and generalization through thorough training is essential for developing capable, deployable AI systems that work robustly in the real world.


## Mathematics of Neural Networks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this section has overlaps with section 3 "Deep Learning Primer" though it goes into more detail here.

I suggest updating Section 3 to point to specific parts on this page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks @pongtr - will do!


After defining our neural network, we are given some training data, which is a set of points ${(x_j, y_j)}$ for $j=1..M$, and we want to evaluate how good our neural network is on fitting this data. To do this, we introduce a **loss function**, which is a function that takes the output of the neural network on a particular datapoint ($N(x_j; W_1, ..., W_n)$), and compares it against the "label" of that particular datapoint (the corresponding $y_j$), and outputs a single numerical scalar (i.e: one real number) that represents how "good" the neural network fit that particular data point; the final measure of how good the neural network is on the entire dataset is therefore just the average of the losses across all datapoints.

There are many different types of loss functions, for example, in the case of image classification, we might use the cross-entropy loss function, which tells us how good two vectors that represent classification predictions compare (i.e: if our prediction predicts that an image is more likely a dog, but the label says it is a cat, it will return a high "loss" indicating a bad fit).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be helpful to show example with smaller "loss" to allow comparison.
E.g. predict "pug" but label was "bulldog"

* All gradients with respect to the weights should have the same shape as the weight matrices themselves
:::

The entire backpropagation process can be complex, especially for networks that are very deep. Fortunately, machine learning frameworks like PyTorch support automatic differentiation, which performs backpropagation for us. In these machine learning frameworks we simply need to specify the forward pass, and the derivatives will be automatically computed for us. Nevertheless, it is beneficial to understand the theoretical process that is happening under the hood in these machine-learning frameworks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to Frameworks section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! done!

Comment on lines +257 to +261
| Data Split | Purpose | Typical Size |
|-|-|-|
| Training Set | Train the model parameters | 60-80% of total data |
| Validation Set | Evaluate model during training to tune hyperparameters and prevent overfitting | ∼20% of total data |
| Test Set | Provide unbiased evaluation of final trained model | ∼20% of total data |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Representing this in the table makes it easy to read and compare 👍

training.qmd Outdated
Comment on lines 512 to 519
Proper weight initialization helps in overcoming issues like vanishing or exploding gradients, which can hinder the learning process. Here are some commonly used neural network weight initialization techniques:

- Computational Constraints
- Data Privacy
- Ethical Considerations

## Conclusion
as orthogonal matrices. This helps in preserving the gradients during backpropagation and can be particularly useful in recurrent neural networks (RNNs).
Uniform and Normal Initialization:

Weights are initialized with random values drawn from a uniform or normal distribution. The choice between uniform and normal depends on the specific requirements of the model and the activation functions used.
Choosing the right initialization method depends on the architecture of the neural network, the activation functions, and the specific problem being solved. Experimentation with different techniques is often necessary to find the most suitable initialization for a given scenario.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something funky going on here when text was copied over

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned it all up in the latest version

training.qmd Outdated
Comment on lines 531 to 554
Activation functions play a critical role in neural networks by introducing non-linearities into the model. These non-linearities enable neural networks to learn complex relationships and patterns in data, making them capable of solving a wide range of problems. Here are some commonly used activation functions:

Explanation: A summary helps to consolidate the key points of the chapter, aiding in better retention and understanding of the material.
* Rectified Linear Unit (ReLU):

ReLU is a popular activation function that returns zero for
negative input values and passes positive input values
unchanged. It is computationally efficient and has been widely
used in deep learning models.

* Sigmoid Function

The sigmoid function, also known as the logistic function,
squashes input values between 0 and 1. It is often used in the
output layer of binary classification models, where the goal is to
produce probabilities.

* Hyperbolic Tangent Function (tanh):

The hyperbolic tangent function is similar to the sigmoid but
squashes input values between -1 and 1. It is often used in hidden
layers of neural networks, especially when zero-centered outputs
are desired.

![Common activation functions](https://1394217531-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvO3qs2RImYjpBE8vln%2Factivation-functions3.jpg?alt=media&token=f96a3007-5888-43c3-a256-2dafadd5df7c){width=70%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diagram shows Linear as well. Could be worth discussing that here as well about when it might be desired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done - thanks!

training.qmd Outdated
performance. Generally, a good rule of thumb for batch size is between
8-128.

### Optimizing Matrix Multiplication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can link to AI Acceleration section

training.qmd Outdated

## Training Parallelization

Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below.
Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallelism and Model Parallelism, discussed below.

Do you mean "Parallelism" / "Parallelization"?

training.qmd Outdated Show resolved Hide resolved
@profvjreddi profvjreddi marked this pull request as ready for review December 7, 2023 15:39
@profvjreddi profvjreddi merged commit 4b42643 into harvard-edge:main Dec 7, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new new course content
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants