Description • Features • Installation • Examples • Details • License
kittygrad is a pet project that offers a familiar toolkit for training neural networks, written entirely in Python using NumPy. The project has the goal of building a minimal foundation for implementing basic popular deep learning techniques. Potentially, for those passing by, kittygrad could be useful for understanding some of the details in this area of machine learning. The concept of dynamic computational graph on tensors is taken as a basis. For simplicity the API is maximally close to PyTorch.
At the moment the project is not finished and offers the following functionality:
- Autograd engine (with 25+ operations)
- Computational graph visualization
- Custom functions support
no_graddecorator and context-manager
At the moment, the best way to install kittygrad is through this repository.
git clone https://github.com/tejpaper/kittygrad.git
cd kittygrad
pip install -r requirements.txtUnit tests are run through pytest. To check the correctness of gradients, an element-wise comparison with tolerance to the results of the same operations in PyTorch is used (see Comparison class in test/conftest.py).
As a consequence, these packages must be installed before running the tests:
pip install -r requirements/test.txtThen simply:
pytestFor a large number of examples, see the examples folder.
import kittygrad as kitty
a = kitty.tensor([1, 2, 3], requires_grad=True)
b = a.std()
b.retain_grad()
c = b * kitty.dot(a, a)
print(c) # tensor(14., grad_fn=<MulBackward>)
c.backward()
print(a.grad) # tensor([-5., 4., 13.])
print(b.grad) # tensor(14.)import kittygrad as kitty
a, b = kitty.randn(2, 2, 3)
a.requires_grad = True
with kitty.CompGraph() as dot:
c = a + 42
c **= 2
_ = c * b
dot.view(cleanup=True) # or just dot in case of jupyter cellThe round shape of the node denotes an operation on tensors or gradient flow, the rectangular shape denotes the tensors themselves. Colors have the following interpretation:
- yellow means no relation to auto-differentiation,
- green color denotes tensors whose gradients we are interested in,
- brown denotes the part of forward computations for which the backward graph is created,
- all backward nodes are marked in blue.
Inspired from this YouTube video.
To properly handle edge cases and provide greater flexibility, kittygrad performs some preliminary steps before performing the operation on a tensor or calculating its gradient. Let's describe the whole process briefly.
Let's assume that we want to perform an operation on a tensor involving another tensor, a NumPy array, or a scalar. First, we call the frontend function (for example, this could be the dunder method __mul__). Next, if necessary, the two operands are cast into tensor form using the autocast decorator. If the operation is element-wise, broadcasting is also applied to align shapes. Then, within the frontend function, additional conditions are checked to determine the validity of the operation. If everything is OK, the backend function is called, passing an additional context variable to store intermediate values that may be needed in the backward pass. Subsequently, the necessary computation takes place, and depending on the requires_grad attribute of the resulting tensor, a backward node is created. The resulting tensor is returned with a reference to it.
The autocast decorator uses the type method and the broadcast_tensors function, both of which are also differentiable. Therefore, a single operation may generate multiple backward nodes simultaneously. More generally, this doesn't mean that complex operations are implemented using a set of simpler ones. For example, computing the variance does not create a MeanBackward node, even though the mean value is computed.
The computation of gradients begins with calling the backward method on the final tensor. Instead of using topological sorting and BFS, kittygrad traverses the backward graph using DFS and the concept of locks. The backward node of each computed tensor in the forward pass has a _lock attribute, indicating the number of operations in which this tensor participated. Thus, when an update to the gradient arrives at the node of this tensor, it accumulates until the lock value drops to zero. This allows simulating topological sorting without an additional graph traversal and, most importantly, enables using lock values to verify the correctness of the constructed computational graph and the backward method call itself. For example, kittygrad will promptly inform you if you initiate the process from the middle of the computational graph or if there are multiple final tensors awaiting the backward method call.
Backpropagation ends in the AccumulateGrad nodes, which transfer the accumulated gradients to the grad attribute of the tensor itself. If the tensor was not initialized by the user but was created during the forward pass, it won't receive its gradient. This can be bypassed by calling retain_grad method on it, which will force its backward node to also perform the AccumulateGrad functions.
Another important aspect worth mentioning is the version control of tensors. In order for inplace operations on tensors to work correctly and without creating unnecessary copies of them, kittygrad assigns a zero version to each tensor, which increments if it is modified in-place (see inplace decorator). If such a tensor is retained in the context of a backward node and is needed for gradient computation, its version is compared to the one it had at the time of creating the backward node, and an exception is raised in case of an error.
Some operations do not create full-fledged tensors in computer memory but rather use another view on an existing data storage. In such cases, the resulting tensor shares its version with its predecessor. This is achieved by using the share decorator.
Above, I described the most crucial points that may not capture the entire picture. Therefore, for those interested, I provide the following links:
- How it works in PyTorch
- micrograd and its awesome spelled-out explanation
- Automatic differentiation in machine learning: a survey
MIT