Skip to content

Commit

Permalink
Merge pull request #2 from open-neuromorphic/main
Browse files Browse the repository at this point in the history
update
  • Loading branch information
neural-loop authored Dec 4, 2023
2 parents d37e8cc + 36467c5 commit 7b1878c
Show file tree
Hide file tree
Showing 5 changed files with 40 additions and 19 deletions.
Binary file added content/english/blog/northpole/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 18 additions & 18 deletions content/english/blog/northpole/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: "Neural inference at the frontier of energy, space and time - NorthPole, IBM"
description: "Translating the new paper from IBM to human language."
image: brain-to-chip.png
image: cover.png
draft: true
date: 2023-11-25
date: 2023-11-28
showTableOfContents: true
author:
- Fabrizio Ottati
Expand All @@ -16,15 +16,15 @@ biology. We will use them as guidelines to analyze the paper.

The outline of this blog post is the same as the original article.

# Axiomatic design
## Axiomatic design

> NorthPole, an architecture and a programming model for neural inference,
reimagines (Fig. 1) the interaction between compute and memory by embodying 10
interrelated, synergistic axioms that build on brain-inspired computing.

Fancy terminology :)

## Axiom 1 - A dedicated DNN inference engine
### Axiom 1 - A dedicated DNN inference engine

> Turning to architecture, NorthPole is specialized for neural inference. For
example, it has no data-dependent conditional branching, and it does not support
Expand Down Expand Up @@ -97,7 +97,7 @@ processing units (GPUs) support 2:4 sparsity, which means that every 4 elements
in a matrix, 2 are zeros (more or less, I am not being extremely precise on
this).

## Axiom 2 - Getting inspired by biological neurons
### Axiom 2 - Getting inspired by biological neurons

> Inspired by biological precision, NorthPole is optimized for 8, 4, and 2-bit
low-precision. This is sufficient to achieve state-of-the-art inference accuracy
Expand Down Expand Up @@ -153,7 +153,7 @@ Moreover, FP16 precision is starting to be enough for training. State of the art
GPUs are also supporting FP8 and _integer_ precision [[NVIDIA H100 Tensor Core
GPU Architecture](https://resources.nvidia.com/en-us-tensor-core)].

## Axiom 3 - Massive computational parallelism
### Axiom 3 - Massive computational parallelism

> NorthPole has a distributed, modular core array (16-by-16), with each core
capable of massive parallelism (8192 2-bit operations per cycle) (Fig. 2F).
Expand Down Expand Up @@ -264,7 +264,7 @@ needed to execute a MAC! Instead, if the MAC unit accesses the data in the PE
itself (the PE register file bar) or from another PE (the NoC bar), the energy
drawback is bearable.

## Axiom 4 - Efficiency in distribution
### Axiom 4 - Efficiency in distribution

> NorthPole distributes memory among cores (Figs. 1B and 2F) and, within a core,
not only places memories near compute (2) but also intertwines critical compute
Expand Down Expand Up @@ -295,7 +295,7 @@ logic or the special purpose macros available on the silicon. I do not know if
it is brain-inspired but it makes sense from a silicon perspective if you want
to maximize efficiency.

## Axiom 5 - A neural Network-on-Chip
### Axiom 5 - A neural Network-on-Chip

> NorthPole uses two dense networks on-chip (NoCs) (20) to interconnect the
cores, unifying and integrating the distributed computation and memory (Fig. 2,
Expand All @@ -311,7 +311,7 @@ network-on-chip (NoC). There are two NoCs in NorthPole: one to exchange the
intermediate results among PEs (the _gray_ matter NoC), one for the inputs of
the neural network (the _white_ matter).

## Axiom 6 - Beyond data: efficient code distribution
### Axiom 6 - Beyond data: efficient code distribution

> Another two NoCs enable reconfiguring synaptic weights and programs on each
core for high-speed operation of compute units (Fig. 2, C and D). The brain’s
Expand All @@ -335,7 +335,7 @@ performed (_i.e._, the sequence of operations to be carried out). The comparison
with TrueNorth is not really fair: completely different designs, completely
different goals.

## Axiom 7 - No branches, lots of party
### Axiom 7 - No branches, lots of party

> NorthPole exploits data-independent branching to support a fully pipelined,
stall-free, deterministic control operation for high temporal utilization
Expand All @@ -352,7 +352,7 @@ data movement is fully deterministic (_e.g._, first I process the channel
dimension, then the width, then the height etc.), I would be _very_ worried if I
had stalls or cache misses :)

## Axiom 8 - Low precision, same performance with backprop
### Axiom 8 - Low precision, same performance with backprop

> Turning to algorithms and software, co-optimized training algorithms (fig. S3)
enable state-of-the-art inference accuracy to be achieved by incorporating
Expand All @@ -368,7 +368,7 @@ of the network: to recover this, the DNN is trained for few more epochs to use
backprop to tune the network taking into account the approximations brought by
the quantization process.

## Axiom 9 - Start optimizing inference from the code
### Axiom 9 - Start optimizing inference from the code

> Codesigned software (fig. S3) automatically determines an explicit
orchestration schedule for computation, memory, and communication to achieve
Expand All @@ -389,7 +389,7 @@ Eyeriss [[Chen et
al.](https://dspace.mit.edu/bitstream/handle/1721.1/101151/eyeriss_isscc_2016.pdf)]
strikes again.

## Axiom 10 - What happens in NorthPole, stays in NorthPole
### Axiom 10 - What happens in NorthPole, stays in NorthPole

> NorthPole employs a usage model that consists of writing an input frame and
reading an output frame (Figs. 1D and 3), which enables it to operate
Expand All @@ -415,7 +415,7 @@ Uhm, real-time embedded system. So it must be super efficient to be run on such
a limited system, right? However, in Table 1 of the paper, the power consumption
required to run an INT8 version of ResNet50 is 74 W. Ouch :)

# Silicon implementation
## Silicon implementation

> NorthPole has been fabricated in a 12-nm process and has 22 billion
transistors in an 800-mm2 area, 256 cores, 2048 (4096 and 8192) operations per
Expand All @@ -432,7 +432,7 @@ weights only) is stored as INT8, occupying 1 B in memory. This means that a
network with 768 k parameters can be hosted on a single core (forgive me, it is
not fully precise as I am considering only the weights).

# Energy, space and time
## Energy, space and time

> For methodological rigor that ensures a fair and level comparison of various
implementations, it is critical that all evaluation metrics be independent of
Expand Down Expand Up @@ -523,7 +523,7 @@ Another reason for which Keller et al. is much more efficient than NorthPole is
that it supports _sparsity-aware processing_, _i.e._, it skips zero computations
without reading the zero values (I am simplifying).

# (My) conclusions
## (My) conclusions

In conclusion, the following statement

Expand Down Expand Up @@ -557,7 +557,7 @@ there where more technical details in the paper, since it is very divulgative. I
am fairly sure that we would have got a different paper if they
chose an IEEE journal instead of Science, where hardware is not really common.

# Acknowledgements
## Acknowledgements

I would like to thank [Jascha Achterberg](https://www.jachterberg.com) for
reviewing this blog post and the super-useful discussion about the
Expand All @@ -566,7 +566,7 @@ authors claim biology inspiration actually proves useful (_e.g._, distributed
memory hierarchy), differently from other approaches that severly compromise
performance (_e.g._, accuracy), with negligible efficiency improvements.

# Bibliography
## Bibliography

* [_Neural inference at the frontier of energy, space, and time_](https://www.science.org/doi/10.1126/science.adh1174), Dharmendra S. Modha et al., Science, 2023.
* [_HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity_](https://arxiv.org/abs/2305.12718), Yannan Nellie Wu et al., IEEE Micro, 2023.
Expand Down
Binary file not shown.
21 changes: 21 additions & 0 deletions content/english/workshops/northpole/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: "IBM NorthPole - Neural inference at the frontier of energy, space, and time"
author:
- "Carlos Ortega-Otero"
date: "2024-01-25"
start_time: 18:00
end_time: 19:30
time_zone: CET
description: ""
upcoming: true
image:
speaker_photo: "carlos.webp"
speaker_bio: "Dr. Carlos Ortega-Otero is an Sr. Research Staff Member at IBM driven by a passion in Circuit Design, Neuromorphic Chip Architectures, Low-Power Circuits and Physical Design optimizations. He earned his Ph.D. from Cornell University under the guidance of Prof. Rajit Manohar. Throughout his career, he has worked in groundbreaking projects, including Ultra-Low Power Asynchronous Sensor Network nodes, Medical Implantable Wireless Sensors, The TrueNorth Brain-Inspired Chip, and the NorthPole Project. At IBM, Carlos works under the leadership of Dr. Dharmendra Modha in the Brain-Inspired Computing Group. He plays key roles in Architecture, Specification, Digital Implementation, Physical Design, Timing Signoff, and Manufacturing teams of the NorthPole Project. Carlos is proud to be part of the Brain-Inspired Computing Group at IBM that continues to shape the future of Integrated Circuits and AI."
---

Computing, since its inception, has been processor-centric, with memory separated from compute. Inspired by the organic brain and optimized for inorganic silicon, NorthPole is a neural inference architecture that blurs this boundary by eliminating off-chip memory, intertwining compute with memory on-chip, and appearing externally as an active memory chip. NorthPole is a low-precision, massively parallel, densely interconnected, energy-efficient, and spatial computing architecture with a co-optimized, high-utilization programming model.

On the ResNet50 benchmark image classification network, relative to a graphics processing unit (GPU) that uses a comparable 12-nanometer technology process, NorthPole achieves a 25 times higher energy metric of frames per second (FPS) per watt, a 5 times higher space metric of FPS per transistor, and a 22 times lower time metric of latency. Similar results are reported for the Yolo-v4 detection network.

NorthPole outperforms all prevalent architectures, even those that use more-advanced technology processes.

Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ upcoming: true
description: "Explore the power of Spyx in a hands-on hackathon session and dive into the world of neuromorphic frameworks with Kade Heckel."
---

Join us on December 14th for an exciting Spyx hackathon and ONM talk! Learn how to use and contribute to [Spyx](https://github.com/kmheckel/spyx), a high-performance spiking neural network library, and gain insights into the latest developments in neuromorphic frameworks. The session will cover Spyx's utilization of memory and GPU to maximize training throughput, along with discussions on the evolving landscape of neuromorphic computing.
Join us on December 13th for an exciting Spyx hackathon and ONM talk! Learn how to use and contribute to [Spyx](https://github.com/kmheckel/spyx), a high-performance spiking neural network library, and gain insights into the latest developments in neuromorphic frameworks. The session will cover Spyx's utilization of memory and GPU to maximize training throughput, along with discussions on the evolving landscape of neuromorphic computing.

Don't miss this opportunity to engage with experts, collaborate on cutting-edge projects, and explore the potential of Spyx in shaping the future of neuromorphic computing. Whether you're a seasoned developer or just curious about the field, this event promises valuable insights and hands-on experience.

Expand Down

0 comments on commit 7b1878c

Please sign in to comment.