Skip to content

Research and Materials on Hardware implementation of Transformer Model

License

Notifications You must be signed in to change notification settings

aliemo/transfomers-silicon-research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Models Silicon Research

Research and Materials on Hardware implementation of Transformer Models

How to Contribute

You can add new papers via pull requests, Please check data/papers.yaml and if your paper is not in list, add entity at the last item and create pull request.

Transformer and BERT Model

  • BERT is a method of pre-training language representations, meaning that we train a general-purpose language understanding model on a large text corpus (like Wikipedia) and then use that model for downstream NLP tasks.

  • BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks.

  • BERT is a Transformer-based model.
    • The architecture of BERT is similar to the original Transformer model, except that BERT has two separate Transformer models: one for the left-to-right direction (the “encoder”) and one for the right-to-left direction (the “encoder”).
    • The output of each model is the hidden state output by the final Transformer layer. The two models are pre-trained jointly on a large corpus of unlabeled text. The pre-training task is a simple and straightforward masked language modeling objective.
    • The pre-trained BERT model can then be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

Reference Papers

1. Attention Is All You Need

DOI-Link PDF-Download

Code-Link Code-Link

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

DOI-Link PDF-Download Code-Link Code-Link

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Hardware Research

2018

Algorithm-Hardware Co-Design of Single Shot Detector for Fast Object Detection on FPGAs

DOI-Link

SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity

DOI-Link PDF-Link


2019

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

DOI-Link

A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning

DOI-Link

An Evaluation of Transfer Learning for Classifying Sales Engagement Emails at Large Scale

DOI-Link

MAGNet: A Modular Accelerator Generator for Neural Networks

DOI-Link PDF-Link

mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator

DOI-Link PDF-Link

Pre-trained bert-gru model for relation extraction

DOI-Link

Q8BERT: Quantized 8Bit BERT

DOI-Link PDF-Link

Structured pruning of a BERT-based question answering model

DOI-Link PDF-Link

Structured pruning of large language models

DOI-Link PDF-Link

Tinybert: Distilling bert for natural language understanding

DOI-Link PDF-Link


2020

A Low-Cost Reconfigurable Nonlinear Core for Embedded DNN Applications

DOI-Link

A Multi-Neural Network Acceleration Architecture

DOI-Link

A Primer in BERTology: What We Know About How BERT Works

DOI-Link

A Reconfigurable DNN Training Accelerator on FPGA

DOI-Link

A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation

DOI-Link

Emerging Neural Workloads and Their Impact on Hardware

DOI-Link

Accelerating event detection with DGCNN and FPGAS

DOI-Link

An Empirical Analysis of BERT Embedding for Automated Essay Scoring

DOI-Link

An investigation on different underlying quantization schemes for pre-trained language models

DOI-Link

ATT: A Fault-Tolerant ReRAM Accelerator for Attention-based Neural Networks

DOI-Link

Binarybert: Pushing the limit of bert quantization

DOI-Link

Capuchin: Tensor-based GPU Memory Management for Deep Learning

DOI-Link

CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails

DOI-Link

CatBERT: Context-Aware Tiny BERT for Detecting Targeted Social Engineering Emails

DOI-Link

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

DOI-Link

Combining Feature Selection Methods with BERT: An In-depth Experimental Study of Long Text Classification

DOI-Link

Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification

DOI-Link

Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

DOI-Link

Compression of deep learning models for NLP

DOI-Link

Deep Learning Acceleration with Neuron-to-Memory Transformation

DOI-Link

Earlybert: Efficient bert training via early-bird lottery tickets

DOI-Link

Efficient algorithms and hardware for natural language processing

DOI-Link

Efficient transformer-based large scale language representations using hardware-friendly block structured pruning

DOI-Link

FARM: A flexible accelerator for recurrent and memory augmented neural networks

DOI-Link

Fastformers: Highly efficient transformer models for natural language understanding

DOI-Link

FTRANS: energy-efficient acceleration of transformers using FPGA

DOI-Link

Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer

DOI-Link

Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems

DOI-Link PDF-Link

Improving post training neural quantization: Layer-wise calibration and integer programming

DOI-Link PDF-Link

Integer quantization for deep learning inference: Principles and empirical evaluation

DOI-Link PDF-Link

Ladabert: Lightweight adaptation of bert through hybrid model compression

DOI-Link PDF-Link

Load What You Need: Smaller Versions of Multilingual BERT

DOI-Link PDF-Link

Look-Up Table based Energy Efficient Processing in Cache Support for Neural Network Acceleration

DOI-Link PDF-Link

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

DOI-Link PDF-Link

Movement Pruning: Adaptive Sparsity by Fine-Tuning

DOI-Link PDF-Link

MSP: an FPGA-specific mixed-scheme, multi-precision deep neural network quantization framework

DOI-Link PDF-Link

Poor Man's BERT: Smaller and Faster Transformer Models

DOI-Link PDF-Link

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

DOI-Link PDF-Link

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

DOI-Link PDF-Link

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

DOI-Link PDF-Link

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

DOI-Link PDF-Link

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

DOI-Link PDF-Link

TernaryBERT: Distillation-aware Ultra-low Bit BERT

DOI-Link PDF-Link

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

DOI-Link PDF-Link

Ultron-AutoML: An open-source, distributed, scalable framework for efficient hyper-parameter optimization

DOI-Link PDF-Link

Towards Fully 8-bit Integer Inference for the Transformer Model

DOI-Link PDF-Link

TopicBERT for energy efficient document classification

DOI-Link PDF-Link


2021

A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators

DOI-Link

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

DOI-Link

A Microcontroller is All You Need: Enabling Transformer Execution on Low-Power IoT Endnodes

DOI-Link

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

DOI-Link

A Study on Token Pruning for ColBERT

DOI-Link

A White Paper on Neural Network Quantization

DOI-Link

Accelerated Device Placement Optimization with Contrastive Learning

DOI-Link

Accelerating bandwidth-bound deep learning inference with main-memory accelerators

DOI-Link

Accelerating Emerging Neural Workloads

DOI-Link

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

DOI-Link

Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning

DOI-Link PDF-Link

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

DOI-Link

Adapting by pruning: A case study on BERT

DOI-Link PDF-Link

Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions

DOI-Link

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA

DOI-Link

Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices

DOI-Link

Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond

DOI-Link

AUBER: Automated BERT regularization

DOI-Link

Automatic Mixed-Precision Quantization Search of BERT

DOI-Link

BERMo: What can BERT learn from ELMo?

DOI-Link

BERT Model for Classification of Fake News using the Cloud Processing Capacity

DOI-Link

Bertinho: Galician BERT representations

DOI-Link

BERxiT: Early exiting for BERT with better fine-tuning and extension to regression

DOI-Link

Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression

DOI-Link

Binary Complex Neural Network Acceleration on FPGA : (Invited Paper)

DOI-Link

Biomedical Named Entity Recognition at Scale

DOI-Link

Block pruning for faster transformers

DOI-Link

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

DOI-Link

DAP-BERT: Differentiable Architecture Pruning of BERT

DOI-Link

Demystifying BERT: Implications for Accelerator Design

DOI-Link

Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

DOI-Link

EAGLE: Expedited Device Placement with Automatic Grouping for Large Models

DOI-Link

EBERT: Efficient BERT Inference with Dynamic Structured Pruning

DOI-Link

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

DOI-Link

ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks

DOI-Link

Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

DOI-Link

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

DOI-Link

Energy efficiency boost in the AI-infused POWER10 processor

DOI-Link

Fixed-point Quantization for Vision Transformer

DOI-Link

FlexACC: A Programmable Accelerator with Application-Specific ISA for Flexible Deep Neural Network Inference

DOI-Link

Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration

DOI-Link

Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference

DOI-Link

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

DOI-Link

Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights

DOI-Link

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

DOI-Link

HoloFormer: Deep Compression of Pre-Trained Transforms via Unified Optimization of N: M Sparsity and Integer Quantization

DOI-Link PDF-Link

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

DOI-Link

How to Train BERT with an Academic Budget

DOI-Link PDF-Link

I-BERT: Integer-only BERT Quantization

DOI-Link PDF-Link

Improving the efficiency of transformers for resource-constrained devices

DOI-Link PDF-Link

KAISA: An adaptive second-order optimizer framework for deep neural networks

DOI-Link PDF-Link

KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization

DOI-Link PDF-Link

Kunlun: A 14nm High-Performance AI Processor for Diversified Workloads

DOI-Link

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling

DOI-Link PDF-Link

Learning Light-Weight Translation Models from Deep Transformer

DOI-Link PDF-Link

M2M: Learning to Enhance Low-Light Image from Model to Mobile FPGA

DOI-Link

NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search

DOI-Link PDF-Link

NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds

DOI-Link PDF-Link

NLP-Fast: A Fast, Scalable, and Flexible System to Accelerate Large-Scale Heterogeneous NLP Models

DOI-Link

NPE: An FPGA-based Overlay Processor for Natural Language Processing

DOI-Link PDF-Link

Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection

DOI-Link

PTQ4ViT: Post-Training Quantization Framework for Vision Transformers with Twin Uniform Quantization

DOI-Link PDF-Link

Randomly Wired Network Based on RoBERTa and Dialog History Attention for Response Selection

DOI-Link

Re2PIM: A Reconfigurable ReRAM-Based PIM Design for Variable-Sized Vector-Matrix Multiplication

DOI-Link

RISC-VTF: RISC-V Based Extended Instruction Set for Transformer

DOI-Link

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

DOI-Link PDF-Link

Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

DOI-Link PDF-Link

Simplified TinyBERT: Knowledge Distillation for Document Retrieval

DOI-Link PDF-Link

SmaQ: Smart Quantization for DNN Training by Exploiting Value Clustering

DOI-Link PDF-Link

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

DOI-Link PDF-Link

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

DOI-Link PDF-Link

SQuAT: Sharpness- and Quantization-Aware Training for BERT

DOI-Link PDF-Link

Stochastic precision ensemble: self-knowledge distillation for quantized deep neural networks

DOI-Link PDF-Link

Talos: A Weighted Speedup-Aware Device Placement of Deep Learning Models

DOI-Link

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

DOI-Link PDF-Link

Training with Quantization Noise for Extreme Model Compression

DOI-Link PDF-Link

Transformer Acceleration with Dynamic Sparse Attention

DOI-Link PDF-Link

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

DOI-Link PDF-Link

Vis-TOP: Visual Transformer Overlay Processor

DOI-Link PDF-Link

Elbert: Fast albert with confidence-window based early exit

DOI-Link PDF-Link

Ghostbert: Generate more features with cheap operations for BERT

DOI-Link PDF-Link

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

DOI-Link PDF-Link

Prune once for all: Sparse pre-trained language models

DOI-Link PDF-Link

ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques

DOI-Link PDF-Link

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

DOI-Link PDF-Link


2022

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing

DOI-Link

A 40nm 5.6TOPS/W 239GOPS/mm2 Self-Attention Processor with Sign Random Projection-based Approximation

DOI-Link

A Dual-Mode Similarity Search Accelerator based on Embedding Compression for Online Cross-Modal Image-Text Retrieval

DOI-Link

A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks

DOI-Link

A Fast Post-Training Pruning Framework for Transformers

DOI-Link

A Framework for Accelerating Transformer-Based Language Model on ReRAM-Based Architecture

DOI-Link

A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining

DOI-Link

A Lite Romanian BERT: ALR-BERT

DOI-Link

A Resource-Saving Energy-Efficient Reconfigurable Hardware Accelerator for BERT-based Deep Neural Network Language Models using FFT Multiplication

DOI-Link

A Self-Attention Network for Deep JSCCM: The Design and FPGA Implementation

DOI-Link

Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array

DOI-Link

Accelerating attention through gradient-based learned runtime pruning

DOI-Link

Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method

DOI-Link

Accelerating Transformer Networks through Recomposing Softmax Layers

DOI-Link PDF-Link

Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

DOI-Link PDF-Link

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

DOI-Link PDF-Link

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

DOI-Link

Alternative non-BERT model choices for the textual classification in low-resource languages and environments

DOI-Link

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

DOI-Link

An Automatic and Efficient BERT Pruning for Edge AI Systems

DOI-Link

An Efficient Hardware Accelerator for Sparse Transformer Neural Networks

DOI-Link

An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention

DOI-Link

An FPGA-Based Transformer Accelerator Using Output Block Stationary Dataflow for Object Recognition Applications

DOI-Link

Analog-memory-based 14nm Hardware Accelerator for Dense Deep Neural Networks including Transformers

DOI-Link

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

DOI-Link

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization

DOI-Link

APT: The master-copy-free training method for quantised neural network on edge devices

DOI-Link

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

DOI-Link

Balance Multi-Head Attention based on Software and Hardware Co-design

DOI-Link

BEBERT: Efficient and robust binary ensemble BERT

DOI-Link

BERT model optimization methods for inference: a comparative study of five alternative BERT-model implementations

DOI-Link

BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

DOI-Link

BERTPerf: Inference Latency Predictor for BERT on ARM big.LITTLE Multi-Core Processors

DOI-Link

BiBERT: Accurate Fully Binarized BERT

DOI-Link

Bigger&Faster: Two-stage Neural Architecture Search for Quantized Transformer Models

DOI-Link

BiT: Robustly Binarized Multi-distilled Transformer

DOI-Link

Boosting Distributed Training Performance of the Unpadded BERT Model

DOI-Link

Compact Token Representations with Contextual Quantization for Efficient Document Re-ranking

DOI-Link

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

DOI-Link

Compression of Generative Pre-trained Language Models via Quantization

DOI-Link

CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration

DOI-Link

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

DOI-Link

Demystifying BERT: System Design Implications

DOI-Link

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

DOI-Link

DiVIT: Algorithm and architecture co-design of differential attention in vision transformer

DOI-Link

DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration

DOI-Link

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

DOI-Link

DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture

DOI-Link

Dynamic Precision Analog Computing for Neural Networks

DOI-Link

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

DOI-Link

Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

DOI-Link

Elastic Processing and Hardware Architectures for Machine Learning

DOI-Link

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

DOI-Link

Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications

DOI-Link

Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems

DOI-Link

Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks

DOI-Link

Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations

DOI-Link

Enabling Fast Uncertainty Estimation: Exploiting Structured Sparsity in Bayesian Transformers

DOI-Link

Ensemble Model Compression for Fast and Energy-Efficient Ranking on FPGAs

DOI-Link

Extending the ONNX Runtime Framework for the Processing-in-Memory Execution

DOI-Link

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

DOI-Link

Fast Heterogeneous Task Mapping for Reducing Edge DNN Latency

DOI-Link

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

DOI-Link

Fine-and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

DOI-Link

FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization: late breaking results

DOI-Link

FPGA-based design and implementation of the location attention mechanism in neural networks

DOI-Link

From dense to sparse: Contrastive pruning for better pre-trained language model compression

DOI-Link

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicastin

DOI-Link

Greedy-layer pruning: Speeding up transformer models for natural language processing

DOI-Link

GuardNN: secure accelerator architecture for privacy-preserving deep learning

DOI-Link

Handling heavy-tailed input of transformer inference on GPUs

DOI-Link

Hardware Acceleration of Transformer Networks using FPGAs

DOI-Link

Hardware and Software Co-design for Soft Switch in ViT Variants Processing Unit

DOI-Link

Hardware and Software Co-optimization for Windows Attention

DOI-Link

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

DOI-Link

Integer Fine-tuning of Transformer-based Models

DOI-Link PDF-Link

Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT)

DOI-Link PDF-Link

Lightweight Composite Re-Ranking for Efficient Keyword Search with BERT

DOI-Link PDF-Link

Lightweight Transformers for Conversational AI

DOI-Link PDF-Link

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

DOI-Link PDF-Link

Low-Bit Quantization of Transformer for Audio Speech Recognition

DOI-Link

Low-Precision Quantization Techniques for Hardware-Implementation-Friendly BERT Models

DOI-Link

MKQ-BERT: Quantized BERT with 4-bits Weights and Activations

DOI-Link PDF-Link

Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models

DOI-Link PDF-Link

Mr. BiQ: Post-Training Non-Uniform Quantization Based on Minimizing the Reconstruction Error

DOI-Link PDF-Link

Near-Optimal Sparse Allreduce for Distributed Deep Learning

DOI-Link PDF-Link

Nebula: A Scalable and Flexible Accelerator for DNN Multi-Branch Blocks on Embedded Systems

DOI-Link PDF-Link

NEEBS: Nonexpert large-scale environment building system for deep neural network

DOI-Link

Optimal Brain Compression: A framework for accurate post-training quantization and pruning

DOI-Link PDF-Link

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

DOI-Link

Post-Training Quantization for Longformer with Chunkwise Quantization Granularity and Optimized Percentile

DOI-Link

Pre-trained Language Model with Feature Reduction and No Fine-Tuning

DOI-Link

Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption

DOI-Link PDF-Link

ProSE: the architecture and design of a protein discovery engine

DOI-Link PDF-Link

QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization

DOI-Link PDF-Link

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

DOI-Link PDF-Link

RCT: Resource Constrained Training for Edge AI

DOI-Link PDF-Link

ReAAP: A Reconfigurable and Algorithm-Oriented Array Processor With Compiler-Architecture Co-Design

DOI-Link PDF-Link

Row-wise Accelerator for Vision Transformer

DOI-Link PDF-Link

S4: a High-sparsity, High-performance AI Accelerator

DOI-Link PDF-Link

SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences

DOI-Link PDF-Link

Searching for memory-lighter architectures for OCR-augmented image captioning

DOI-Link

SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

DOI-Link PDF-Link

Sentiment Analysis Using Pre-Trained Language Model With No Fine-Tuning and Less Resource

DOI-Link PDF-Link

Software and Hardware Fusion Multi-Head Attention

DOI-Link

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

DOI-Link PDF-Link

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

DOI-Link PDF-Link

T-OPU: An FPGA-based Overlay Processor for Natural Language Processing

DOI-Link PDF-Link

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

DOI-Link PDF-Link

Towards efficient post-training quantization of pre-trained language models

DOI-Link PDF-Link

Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

DOI-Link PDF-Link

TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes

DOI-Link

TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer

DOI-Link PDF-Link

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

DOI-Link PDF-Link

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

DOI-Link

ViA: A Novel Vision-Transformer Accelerator Based on FPGA

DOI-Link PDF-Link

Work-in-Progress: Utilizing latency and accuracy predictors for efficient hardware-aware NAS

DOI-Link

XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

DOI-Link PDF-Link

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

DOI-Link

Fully Unsupervised Machine Translation Using Context-Aware Word Translation and Denoising Autoencoder

DOI-Link PDF-Link

DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT

DOI-Link PDF-Link

Data Movement Reduction for DNN Accelerators: Enabling Dynamic Quantization Through an eFPGA

DOI-Link PDF-Link

Hardware-friendly compression and hardware acceleration for transformer: A survey

DOI-Link PDF-Link

Hardware/Software Co-Design of Edge DNN Accelerators with TFLite

DOI-Link

Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates

DOI-Link


2023

An Efficient Transformer Inference Engine on DSP

DOI-Link

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

DOI-Link

DTATrans: Leveraging Dynamic Token-Based Quantization With Accuracy Compensation Mechanism for Efficient Transformer Architecture

DOI-Link

ENEX-FP: A BERT-Based Address Recognition Model

DOI-Link

HAMMER: Hardware-friendly Approximate Computing for Self-attention with Mean-redistribution and Linearization

DOI-Link

SECDA-TFLite: A toolkit for efficient development of FPGA-based DNN accelerators for edge inference

DOI-Link PDF-Link

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

DOI-Link PDF-Link

Sparse*BERT: Sparse Models Generalize To New tasks and Domains

DOI-Link PDF-Link

Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

DOI-Link PDF-Link

TiC-SAT: Tightly-Coupled Systolic Accelerator for Transformers

DOI-Link PDF-Link

ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention

DOI-Link PDF-Link

ViTA: A Vision Transformer Inference Accelerator for Edge Applications

DOI-Link PDF-Link

Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning

DOI-Link PDF-Link

TRON: Transformer Neural Network Acceleration with Non-Coherent Silicon Photonics

DOI-Link PDF-Link

TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference

DOI-Link PDF-Link

TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

DOI-Link PDF-Link

Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning

DOI-Link PDF-Link

Hardware-efficient Softmax Approximation for Self-Attention Networks

DOI-Link

Fast Prototyping Next-Generation Accelerators for New ML Models using MASE: ML Accelerator System Exploration

DOI-Link PDF-Link

Advances in Electromagnetics Empowered by Artificial Intelligence and Deep Learning

DOI-Link

A Scalable GPT-2 Inference Hardware Architecture on FPGA

DOI-Link

BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM

DOI-Link PDF-Link

Integrated Transformers Inference Framework for Multiple Tenants on GPU

DOI-Link PDF-Link

Embedded Deep Learning Accelerators: A Survey on Recent Advances

DOI-Link

Collective Communication Enabled Transformer Acceleration on Heterogeneous Clusters

DOI-Link PDF-Link

FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

DOI-Link

Racism and Hate Speech Detection on Twitter: A QAHA-Based Hybrid Deep Learning Approach Using LSTM-CNN

DOI-Link PDF-Link

22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

DOI-Link

P^3 ViT: A CIM-Based High-Utilization Architecture With Dynamic Pruning and Two-Way Ping-Pong Macro for Vision Transformer

DOI-Link

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

DOI-Link PDF-Link

Enabling efficient edge intelligence: a hardware-software codesign approach

DOI-Link PDF-Link

Automatic Kernel Generation for Large Language Models on Deep Learning Accelerators

DOI-Link

A Low-Latency and Lightweight FPGA-Based Engine for Softmax and Layer Normalization Acceleration

DOI-Link

PP-Transformer: Enable Efficient Deployment of Transformers Through Pattern Pruning

DOI-Link

DEAP: Design Space Exploration for DNN Accelerator Parallelism

DOI-Link PDF-Link

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

DOI-Link PDF-Link

An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference

DOI-Link

Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization

DOI-Link

A Lightweight Transformer Model using Neural ODE for FPGAs

DOI-Link

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

DOI-Link PDF-Link

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

DOI-Link PDF-Link

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

DOI-Link PDF-Link

X-Former: In-Memory Acceleration of Transformers

DOI-Link PDF-Link

GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models

DOI-Link PDF-Link

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

DOI-Link PDF-Link

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

DOI-Link PDF-Link

AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers

DOI-Link PDF-Link

Streaming Tensor Programs: A Programming Abstraction for Streaming Dataflow Accelerators

DOI-Link PDF-Link


2024

A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

DOI-Link PDF-Link

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

DOI-Link PDF-Link

Accelerating Neural Networks for Large Language Models and Graph Processing with Silicon Photonics

DOI-Link PDF-Link

Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models

DOI-Link

RDCIM: RISC-V Supported Full-Digital Computing-in-Memory Processor With High Energy Efficiency and Low Area Overhead

DOI-Link

A Survey on Hardware Accelerators for Large Language Models

DOI-Link PDF-Link

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

DOI-Link PDF-Link

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

DOI-Link PDF-Link

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

DOI-Link PDF-Link

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators

DOI-Link PDF-Link

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators

DOI-Link PDF-Link

The Era of Generative Artificial Intelligence: In-Memory Computing Perspective

DOI-Link

Hydragen: High-Throughput LLM Inference with Shared Prefixes

DOI-Link PDF-Link

A Survey on Transformer Compression

DOI-Link PDF-Link

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

DOI-Link PDF-Link

Stochastic Spiking Attention: Accelerating Attention with Stochastic Computing in Spiking Networks

DOI-Link PDF-Link

Reusing Softmax Hardware Unit for GELU Computation in Transformers

DOI-Link PDF-Link

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters

DOI-Link PDF-Link

Speculative Streaming: Fast LLM Inference without Auxiliary Models

DOI-Link PDF-Link

H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices

DOI-Link PDF-Link

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

DOI-Link PDF-Link

Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication

DOI-Link PDF-Link

DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing

DOI-Link PDF-Link

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

DOI-Link PDF-Link

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

DOI-Link PDF-Link

Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems

DOI-Link PDF-Link

Impact of High-Level-Synthesis on Reliability of Artificial Neural Network Hardware Accelerators

DOI-Link PDF-Link

An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

DOI-Link PDF-Link

TransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization

DOI-Link

PRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers

DOI-Link

SWAT: An Efficient Swin Transformer Accelerator Based on FPGA

DOI-Link

VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

DOI-Link PDF-Link

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

DOI-Link PDF-Link

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

DOI-Link PDF-Link

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

DOI-Link PDF-Link

VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design

DOI-Link PDF-Link

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

DOI-Link PDF-Link

SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators

DOI-Link PDF-Link

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

DOI-Link PDF-Link

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

DOI-Link PDF-Link

TensorMap: A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators

DOI-Link

JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

DOI-Link PDF-Link

DCT-ViT: High-Frequency Pruned Vision Transformer with Discrete Cosine Transform

DOI-Link PDF-Link

TransAxx: Efficient Transformers with Approximate Computing

DOI-Link PDF-Link

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

DOI-Link PDF-Link

CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing

DOI-Link PDF-Link

ProTEA: Programmable Transformer Encoder Acceleration on FPGA

DOI-Link PDF-Link

CAT: Customized Transformer Accelerator Framework on Versal ACAP

DOI-Link PDF-Link

Co-design of a TinyLLM using Programmable Logic and Software on an FPGA

DOI-Link

BitShare: An Efficient Precision-Scalable Accelerator with Combining-Like-Terms GEMM

DOI-Link PDF-Link

SDA: Low-Bit Stable Diffusion Acceleration on Edge FPGA

DOI-Link PDF-Link

Hardware Accelerator for MobileViT Vision Transformer with Reconfigurable Computation

DOI-Link

In-Memory Transformer Self-Attention Mechanism Using Passive Memristor Crossbar

DOI-Link

A 3.55 mJ/frame Energy-efficient Mixed-Transformer based Semantic Segmentation Accelerator for Mobile Devices

DOI-Link

FLAG: Formula-LLM-Based Auto-Generator for Baseband Hardware

DOI-Link

CV-CIM: A Hybrid Domain Xor-Derived Similarity-Aware Computation-in-Memory Supporting Cost–Volume Construction

DOI-Link

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

DOI-Link PDF-Link

ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks

DOI-Link PDF-Link

CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

DOI-Link PDF-Link

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

DOI-Link PDF-Link

SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch

DOI-Link

SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch

DOI-Link

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

DOI-Link

TCP: A Tensor Contraction Processor for AI Workloads Industrial Product

DOI-Link

A 109-GOPs/W FPGA-based Vision Transformer Accelerator with Weight-Loop Dataflow Featuring Data Reusing and Resource Saving

DOI-Link

Klotski v2: Improved DNN Model Orchestration Framework for Dataflow Architecture Accelerators

DOI-Link

Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models

DOI-Link

Inference with Transformer Encoders on ARM and RISC-V Multicore Processors

DOI-Link

Mentor: A Memory-Eficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product

DOI-Link

Cost-Effective LLM Accelerator Using Processing in Memory Technology

DOI-Link

A 28nm 4.35TOPS/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

DOI-Link

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-art to Future Opportunities

DOI-Link PDF-Link

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

DOI-Link

Efficient Transformer Acceleration via Reconfiguration for Encoder and Decoder Models and Sparsity-Aware Algorithm Mapping

DOI-Link PDF-Link

VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks

DOI-Link

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

DOI-Link PDF-Link

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

DOI-Link PDF-Link

Hardware-oriented algorithms for softmax and layer normalization of large language models

DOI-Link PDF-Link

Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

DOI-Link PDF-Link

Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

DOI-Link PDF-Link

DSTC: Dual-Side Sparse Tensor Core for DNNs Acceleration on Modern GPU Architectures

DOI-Link

Power Efficient ASIC Design for Vision Transformer using Systolic Array Matrix Multiplier

DOI-Link

M^2-ViT: Accelerating Hybrid Vision Transformers with Two-Level Mixed Quantization

DOI-Link PDF-Link

A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration

DOI-Link PDF-Link

OPASCA: Outer Product Based Accelerator With Unified Architecture for Sparse Convolution and Attention

DOI-Link

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

DOI-Link

Analysis Towards Deployment and Acceleration for ViT on a Lightweight RISC- V Processor

DOI-Link

Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy

DOI-Link


Analysis