-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AI Training - Max #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like references and links are still WIP so won't focus on that
* **Data is crucial:** Machine learning models learn from examples in training data. More high-quality, representative data leads to better model performance. Data needs to be processed and formatted for training. | ||
* **Algorithms learn from data:** Different algorithms (neural networks, decision trees, etc.) have different approaches to finding patterns in data. Choosing the right algorithm for the task is important. | ||
* **Training refines model parameters:** Model training adjusts internal parameters to find patterns in data. Advanced models like neural networks have many adjustable weights. Training iteratively adjusts weights to minimize a loss function. | ||
* **Generalization is the goal:** A model that overfits to the training data will not generalize well. Regularization techniques (dropout, early stopping, etc.) reduce overfitting. Validation data is used to evaluate generalization. | ||
* **Training takes compute resources:** Training complex models requires significant processing power and time. Hardware improvements and distributed training across GPUs/TPUs have enabled advances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add links to other sections of the book or subsections within chapter to help reader jump around
eg. these points map onto the subsections below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion, I tried it but given the latest updates it doesn't seem to work well
We will walk you through these details in the rest of the sections. Understanding how to effectively leverage data, algorithms, parameter optimization, and generalization through thorough training is essential for developing capable, deployable AI systems that work robustly in the real world. | ||
|
||
|
||
## Mathematics of Neural Networks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this section has overlaps with section 3 "Deep Learning Primer" though it goes into more detail here.
I suggest updating Section 3 to point to specific parts on this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks @pongtr - will do!
|
||
After defining our neural network, we are given some training data, which is a set of points ${(x_j, y_j)}$ for $j=1..M$, and we want to evaluate how good our neural network is on fitting this data. To do this, we introduce a **loss function**, which is a function that takes the output of the neural network on a particular datapoint ($N(x_j; W_1, ..., W_n)$), and compares it against the "label" of that particular datapoint (the corresponding $y_j$), and outputs a single numerical scalar (i.e: one real number) that represents how "good" the neural network fit that particular data point; the final measure of how good the neural network is on the entire dataset is therefore just the average of the losses across all datapoints. | ||
|
||
There are many different types of loss functions, for example, in the case of image classification, we might use the cross-entropy loss function, which tells us how good two vectors that represent classification predictions compare (i.e: if our prediction predicts that an image is more likely a dog, but the label says it is a cat, it will return a high "loss" indicating a bad fit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be helpful to show example with smaller "loss" to allow comparison.
E.g. predict "pug" but label was "bulldog"
* All gradients with respect to the weights should have the same shape as the weight matrices themselves | ||
::: | ||
|
||
The entire backpropagation process can be complex, especially for networks that are very deep. Fortunately, machine learning frameworks like PyTorch support automatic differentiation, which performs backpropagation for us. In these machine learning frameworks we simply need to specify the forward pass, and the derivatives will be automatically computed for us. Nevertheless, it is beneficial to understand the theoretical process that is happening under the hood in these machine-learning frameworks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to Frameworks section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! done!
| Data Split | Purpose | Typical Size | | ||
|-|-|-| | ||
| Training Set | Train the model parameters | 60-80% of total data | | ||
| Validation Set | Evaluate model during training to tune hyperparameters and prevent overfitting | ∼20% of total data | | ||
| Test Set | Provide unbiased evaluation of final trained model | ∼20% of total data | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Representing this in the table makes it easy to read and compare 👍
training.qmd
Outdated
Proper weight initialization helps in overcoming issues like vanishing or exploding gradients, which can hinder the learning process. Here are some commonly used neural network weight initialization techniques: | ||
|
||
- Computational Constraints | ||
- Data Privacy | ||
- Ethical Considerations | ||
|
||
## Conclusion | ||
as orthogonal matrices. This helps in preserving the gradients during backpropagation and can be particularly useful in recurrent neural networks (RNNs). | ||
Uniform and Normal Initialization: | ||
|
||
Weights are initialized with random values drawn from a uniform or normal distribution. The choice between uniform and normal depends on the specific requirements of the model and the activation functions used. | ||
Choosing the right initialization method depends on the architecture of the neural network, the activation functions, and the specific problem being solved. Experimentation with different techniques is often necessary to find the most suitable initialization for a given scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like something funky going on here when text was copied over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned it all up in the latest version
training.qmd
Outdated
Activation functions play a critical role in neural networks by introducing non-linearities into the model. These non-linearities enable neural networks to learn complex relationships and patterns in data, making them capable of solving a wide range of problems. Here are some commonly used activation functions: | ||
|
||
Explanation: A summary helps to consolidate the key points of the chapter, aiding in better retention and understanding of the material. | ||
* Rectified Linear Unit (ReLU): | ||
|
||
ReLU is a popular activation function that returns zero for | ||
negative input values and passes positive input values | ||
unchanged. It is computationally efficient and has been widely | ||
used in deep learning models. | ||
|
||
* Sigmoid Function | ||
|
||
The sigmoid function, also known as the logistic function, | ||
squashes input values between 0 and 1. It is often used in the | ||
output layer of binary classification models, where the goal is to | ||
produce probabilities. | ||
|
||
* Hyperbolic Tangent Function (tanh): | ||
|
||
The hyperbolic tangent function is similar to the sigmoid but | ||
squashes input values between -1 and 1. It is often used in hidden | ||
layers of neural networks, especially when zero-centered outputs | ||
are desired. | ||
|
||
![Common activation functions](https://1394217531-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvO3qs2RImYjpBE8vln%2Factivation-functions3.jpg?alt=media&token=f96a3007-5888-43c3-a256-2dafadd5df7c){width=70%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diagram shows Linear as well. Could be worth discussing that here as well about when it might be desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done - thanks!
training.qmd
Outdated
performance. Generally, a good rule of thumb for batch size is between | ||
8-128. | ||
|
||
### Optimizing Matrix Multiplication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can link to AI Acceleration section
training.qmd
Outdated
|
||
## Training Parallelization | ||
|
||
Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallel and Model Parallel, discussed below. | |
Training can be computationally intensive. As outlined above, backpropagation can be expensive both in terms of computation and memory. In terms of computation, backpropagation requires many large matrix multiplications which require considerable arithmetic operation. In terms of memory, backpropagation requires storing the model parameters and intermediate activations in memory, which can be significant if training large models. Handling these difficult system challenges is key to training large models. To mitigate these computational and memory challenges, parallelization is necessary. Broadly, there are two different approaches to parallelizing training: Data Parallelism and Model Parallelism, discussed below. |
Do you mean "Parallelism" / "Parallelization"?
…into pr/59-training
Before submitting your Pull Request, please ensure that you have carefully reviewed and completed all items on this checklist.
Content
References & Citations
Quarto Website Rendering
Grammar & Style
Collaboration
Miscellaneous
Final Steps