Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.12758
Cited By
Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models
23 March 2022
Ali Hadi Zadeh
Mostafa Mahmoud
Ameer Abdelhadi
Andreas Moshovos
MQ
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models"
39 / 39 papers shown
Title
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Minsu Kim
Seongmin Hong
RyeoWook Ko
S. Choi
Hunjong Lee
Junsoo Kim
Joo-Young Kim
Jongse Park
80
0
0
24 Mar 2025
Ditto: Accelerating Diffusion Model via Temporal Value Similarity
Sungbin Kim
Hyunwuk Lee
Wonho Cho
Mincheol Park
Won Woo Ro
111
1
0
20 Jan 2025
Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures
Archit Parnami
Rahul Singh
Tarun Joshi
37
5
0
28 Oct 2021
Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling
Kyuhong Shim
Iksoo Choi
Wonyong Sung
Jungwook Choi
52
16
0
07 Oct 2021
Compute and Energy Consumption Trends in Deep Learning Inference
Radosvet Desislavov
Fernando Martínez-Plumed
José Hernández-Orallo
49
115
0
12 Sep 2021
Block Pruning For Faster Transformers
François Lagunas
Ella Charlaix
Victor Sanh
Alexander M. Rush
VLM
59
221
0
10 Sep 2021
Differentiable Subset Pruning of Transformer Heads
Jiaoda Li
Ryan Cotterell
Mrinmaya Sachan
84
56
0
10 Aug 2021
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Yiding Jiang
Bidisha Sharma
Maulik C. Madhavi
Haizhou Li
55
26
0
05 Aug 2021
Learned Token Pruning for Transformers
Sehoon Kim
Sheng Shen
D. Thorsley
A. Gholami
Woosuk Kwon
Joseph Hassoun
Kurt Keutzer
51
154
0
02 Jul 2021
Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation
Ryo Masumura
Naoki Makishima
Mana Ihori
Akihiko Takashima
Tomohiro Tanaka
Shota Orihashi
47
29
0
16 Feb 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
83
2,178
0
11 Jan 2021
I-BERT: Integer-only BERT Quantization
Sehoon Kim
A. Gholami
Z. Yao
Michael W. Mahoney
Kurt Keutzer
MQ
139
351
0
05 Jan 2021
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
363
6,757
0
23 Dec 2020
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Hanrui Wang
Zhekai Zhang
Song Han
103
390
0
17 Dec 2020
TernaryBERT: Distillation-aware Ultra-low Bit BERT
Wei Zhang
Lu Hou
Yichun Yin
Lifeng Shang
Xiao Chen
Xin Jiang
Qun Liu
MQ
80
210
0
27 Sep 2020
Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation
Insoo Chung
Byeongwook Kim
Yoonjung Choi
S. Kwon
Yongkweon Jeon
Baeseong Park
Sangha Kim
Dongsoo Lee
MQ
46
27
0
16 Sep 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
510
2,080
0
28 Jul 2020
Data Movement Is All You Need: A Case Study on Optimizing Transformers
A. Ivanov
Nikoli Dryden
Tal Ben-Nun
Shigang Li
Torsten Hoefler
91
134
0
30 Jun 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
174
1,755
0
29 Jun 2020
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
191
1,700
0
08 Jun 2020
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
AAML
135
2,730
0
05 Jun 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Hanrui Wang
Zhanghao Wu
Zhijian Liu
Han Cai
Ligeng Zhu
Chuang Gan
Song Han
85
262
0
28 May 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
708
41,894
0
28 May 2020
GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference
Ali Hadi Zadeh
Isak Edo
Omar Mohamed Awad
Andreas Moshovos
MQ
53
188
0
08 May 2020
The Cost of Training NLP Models: A Concise Overview
Or Sharir
Barak Peleg
Y. Shoham
94
213
0
19 Apr 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
133
4,061
0
10 Apr 2020
Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
Prakhar Ganesh
Yao Chen
Xin Lou
Mohammad Ali Khan
Yifan Yang
Hassan Sajjad
Preslav Nakov
Deming Chen
Marianne Winslett
AI4CE
90
200
0
27 Feb 2020
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
Mitchell A. Gordon
Kevin Duh
Nicholas Andrews
VLM
48
340
0
19 Feb 2020
MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models
Linqing Liu
Haiquan Wang
Jimmy J. Lin
R. Socher
Caiming Xiong
34
21
0
09 Nov 2019
Q8BERT: Quantized 8Bit BERT
Ofir Zafrir
Guy Boudoukh
Peter Izsak
Moshe Wasserblat
MQ
66
505
0
14 Oct 2019
Knowledge Distillation from Internal Representations
Gustavo Aguilar
Yuan Ling
Yu Zhang
Benjamin Yao
Xing Fan
Edward Guo
70
182
0
08 Oct 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
214
7,481
0
02 Oct 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
543
24,422
0
26 Jul 2019
Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going
Erwei Wang
James J. Davis
Ruizhe Zhao
Ho-Cheung Ng
Xinyu Niu
Wayne Luk
P. Cheung
George A. Constantinides
41
59
0
21 Jan 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,729
0
11 Oct 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
1.0K
7,152
0
20 Apr 2018
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Zhe Jia
Marco Maggioni
Benjamin Staiger
D. Scarpazza
48
309
0
18 Apr 2018
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
254
8,124
0
16 Jun 2016
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Song Han
Huizi Mao
W. Dally
3DGS
249
8,832
0
01 Oct 2015
1