Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1910.06188
Cited By
Q8BERT: Quantized 8Bit BERT
14 October 2019
Ofir Zafrir
Guy Boudoukh
Peter Izsak
Moshe Wasserblat
MQ
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Q8BERT: Quantized 8Bit BERT"
50 / 304 papers shown
Title
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Xiaoxia Wu
Haojun Xia
Stephen Youn
Zhen Zheng
Shiyang Chen
...
Reza Yazdani Aminabadi
Yuxiong He
Olatunji Ruwase
Leon Song
Zhewei Yao
78
8
0
14 Dec 2023
FP8-BERT: Post-Training Quantization for Transformer
Jianwei Li
Tianchi Zhang
Ian En-Hsu Yen
Dongkuan Xu
MQ
23
5
0
10 Dec 2023
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
Srinath Namburi
Makesh Narsimhan Sreedhar
Srinath Srinivasan
Frederic Sala
MQ
28
9
0
01 Dec 2023
Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices
Gokul Srinivasagan
Michael Deisher
Munir Georges
VLM
27
0
0
30 Nov 2023
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
Yunpeng Huang
Jingwei Xu
Junyu Lai
Zixu Jiang
Taolue Chen
...
Xiaoxing Ma
Lijuan Yang
Zhou Xin
Shupeng Li
Penghao Zhao
LLMAG
KELM
49
56
0
21 Nov 2023
EELBERT: Tiny Models through Dynamic Embeddings
Gabrielle Cohn
Rishika Agarwal
Deepanshu Gupta
Siddharth Patwardhan
24
2
0
31 Oct 2023
ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers
Zhewei Yao
Reza Yazdani Aminabadi
Stephen Youn
Xiaoxia Wu
Elton Zheng
Yuxiong He
MQ
21
1
0
26 Oct 2023
Watermarking LLMs with Weight Quantization
Linyang Li
Botian Jiang
Pengyu Wang
Ke Ren
Hang Yan
Xipeng Qiu
MQ
WaLM
21
11
0
17 Oct 2023
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
MoE
27
18
0
16 Oct 2023
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
Jongwoo Ko
Seungjoon Park
Yujin Kim
Sumyeong Ahn
Du-Seong Chang
Euijai Ahn
SeYoung Yun
16
4
0
16 Oct 2023
A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Takuma Udagawa
Aashka Trivedi
Michele Merler
Bishwaranjan Bhattacharjee
52
7
0
13 Oct 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Yixiao Li
Yifan Yu
Chen Liang
Pengcheng He
Nikos Karampatziakis
Weizhu Chen
Tuo Zhao
MQ
41
126
0
12 Oct 2023
Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention
Huiyin Xue
Nikolaos Aletras
50
0
0
11 Oct 2023
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu
Naihao Deng
Sahand Sabour
Yilin Jia
Minlie Huang
Rada Mihalcea
40
18
0
09 Oct 2023
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
Cheng Zhang
Jianyi Cheng
Ilia Shumailov
George A. Constantinides
Yiren Zhao
MQ
21
9
0
08 Oct 2023
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman
Zhao Song
48
33
0
06 Oct 2023
A Study of Quantisation-aware Training on Time Series Transformer Models for Resource-constrained FPGAs
Tianheng Ling
Chao Qian
Lukas Einhaus
Gregor Schiele
16
1
0
04 Oct 2023
QuATON: Quantization Aware Training of Optical Neurons
Hasindu Kariyawasam
Ramith Hettiarachchi
Quansan Yang
Alex Matlock
Takahiro Nambara
Hiroyuki Kusaka
Yuichiro Kunai
Peter T C So
Edward S Boyden
D. Wadduwage
MQ
32
1
0
04 Oct 2023
AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming
Siva Satyendra Sahoo
Salim Ullah
Akash Kumar
29
1
0
23 Sep 2023
Softmax Bias Correction for Quantized Generative Models
N. Pandey
Marios Fournarakis
Chirag I. Patel
Markus Nagel
DiffM
25
11
0
04 Sep 2023
Sparse Binary Transformers for Multivariate Time Series Modeling
Matt Gorbett
Hossein Shirazi
I. Ray
AI4TS
37
13
0
09 Aug 2023
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Yu Jiang
Qiaozhi He
Xiaomin Zhuang
Zhihua Wu
Kunpeng Wang
Wenlai Zhao
Guangwen Yang
KELM
32
3
0
07 Aug 2023
Tango: rethinking quantization for graph neural network training on GPUs
Shiyang Chen
Da Zheng
Caiwen Ding
Chengying Huan
Yuede Ji
Hang Liu
GNN
MQ
36
5
0
02 Aug 2023
A Survey of Techniques for Optimizing Transformer Inference
Krishna Teja Chitty-Venkata
Sparsh Mittal
M. Emani
V. Vishwanath
Arun Somani
54
63
0
16 Jul 2023
Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for Parameter-Efficient BERT
Souvik Kundu
S. Nittur
Maciej Szankin
Sairam Sundaresan
MQ
33
2
0
14 Jul 2023
Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models
James OÑeill
Sourav Dutta
VLM
MQ
47
1
0
12 Jul 2023
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
Seongjun Yang
Gibbeum Lee
Jaewoong Cho
Dimitris Papailiopoulos
Kangwook Lee
28
33
0
12 Jul 2023
Large Language Models as General Pattern Machines
Suvir Mirchandani
F. Xia
Peter R. Florence
Brian Ichter
Danny Driess
Montse Gonzalez Arenas
Kanishka Rao
Dorsa Sadigh
Andy Zeng
LLMAG
61
186
0
10 Jul 2023
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers
Gamze Islamoglu
Moritz Scherer
G. Paulin
Tim Fischer
Victor J. B. Jung
Angelo Garofalo
Luca Benini
MQ
32
11
0
07 Jul 2023
BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models
Phuoc-Hoan Charles Le
Xinlin Li
ViT
MQ
33
21
0
29 Jun 2023
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Haihao Shen
Hengyu Meng
Bo Dong
Zhe Wang
Ofir Zafrir
...
Hanwen Chang
Qun Gao
Zi. Wang
Guy Boudoukh
Moshe Wasserblat
MoE
41
4
0
28 Jun 2023
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Yelysei Bondarenko
Markus Nagel
Tijmen Blankevoort
MQ
28
87
0
22 Jun 2023
Training Transformers with 4-bit Integers
Haocheng Xi
Changhao Li
Jianfei Chen
Jun Zhu
MQ
30
48
0
21 Jun 2023
SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim
Coleman Hooper
A. Gholami
Zhen Dong
Xiuyu Li
Sheng Shen
Michael W. Mahoney
Kurt Keutzer
MQ
38
168
0
13 Jun 2023
Revisiting Token Pruning for Object Detection and Instance Segmentation
Yifei Liu
Mathias Gehrig
Nico Messikommer
Marco Cannici
Davide Scaramuzza
ViT
VLM
50
25
0
12 Jun 2023
How Can Recommender Systems Benefit from Large Language Models: A Survey
Jianghao Lin
Xinyi Dai
Yunjia Xi
Weiwen Liu
Bo Chen
...
Chenxu Zhu
Huifeng Guo
Yong Yu
Ruiming Tang
Weinan Zhang
LRM
52
197
0
09 Jun 2023
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Clemens J. S. Schaefer
Navid Lambert-Shirzad
Xiaofan Zhang
Chia-Wei Chou
T. Jablin
Jian Li
Elfie Guo
Caitlin Stanton
S. Joshi
Yu Emma Wang
MQ
41
2
0
08 Jun 2023
Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
Wangchunshu Zhou
Ronan Le Bras
Yejin Choi
21
0
0
04 Jun 2023
Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings
Daniel Rotem
Michael Hassid
Jonathan Mamou
Roy Schwartz
32
5
0
04 Jun 2023
Binary and Ternary Natural Language Generation
Zechun Liu
Barlas Oğuz
Aasish Pappu
Yangyang Shi
Raghuraman Krishnamoorthi
MQ
41
6
0
02 Jun 2023
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
J. H. Lee
Jeonghoon Kim
S. Kwon
Dongsoo Lee
MQ
35
33
0
01 Jun 2023
LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction
Jeremiah Milbauer
Annie Louis
Mohammad Javad Hosseini
Alex Fabrikant
Donald Metzler
Tal Schuster
46
9
0
31 May 2023
Intriguing Properties of Quantization at Scale
Arash Ahmadian
Saurabh Dash
Hongyu Chen
Bharat Venkitesh
Stephen Gou
Phil Blunsom
Ahmet Üstün
Sara Hooker
MQ
54
38
0
30 May 2023
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
Zhuocheng Gong
Jiahao Liu
Qifan Wang
Yang Yang
Jingang Wang
Wei Wu
Yunsen Xian
Dongyan Zhao
Rui Yan
MQ
46
5
0
30 May 2023
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics
A. Ardakani
Altan Haan
Shangyin Tan
Doru-Thom Popovici
Alvin Cheung
Costin Iancu
Koushik Sen
24
2
0
29 May 2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu
Barlas Oğuz
Changsheng Zhao
Ernie Chang
Pierre Stock
Yashar Mehdad
Yangyang Shi
Raghuraman Krishnamoorthi
Vikas Chandra
MQ
60
191
0
29 May 2023
One-stop Training of Multiple Capacity Models
Lan Jiang
Haoyang Huang
Dongdong Zhang
R. Jiang
Furu Wei
33
0
0
23 May 2023
Dynamic Transformers Provide a False Sense of Efficiency
Yiming Chen
Simin Chen
Zexin Li
Wei Yang
Cong Liu
R. Tan
Haizhou Li
AAML
46
9
0
20 May 2023
Lifting the Curse of Capacity Gap in Distilling Language Models
Chen Zhang
Yang Yang
Jiahao Liu
Jingang Wang
Yunsen Xian
Benyou Wang
Dawei Song
MoE
32
19
0
20 May 2023
LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma
Gongfan Fang
Xinchao Wang
37
375
0
19 May 2023
Previous
1
2
3
4
5
6
7
Next