Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1910.06188
Cited By
Q8BERT: Quantized 8Bit BERT
14 October 2019
Ofir Zafrir
Guy Boudoukh
Peter Izsak
Moshe Wasserblat
MQ
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Q8BERT: Quantized 8Bit BERT"
50 / 304 papers shown
Title
HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models
Junxuan Zhang
Rongxiang Weng
Haoyang Li
Lidan Shou
Ke Chen
Gang Chen
Qin Xie
Guiming Xie
Xuejian Gong
33
0
0
24 Apr 2025
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
Ye Qiao
Zhiheng Cheng
Yian Wang
Yifan Zhang
Yunzhe Deng
Sitao Huang
79
0
0
22 Apr 2025
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Jiajun Li
Yixing Xu
Haiduo Huang
Xuanwu Yin
D. Li
Edith C. -H. Ngai
E. Barsoum
61
0
0
13 Mar 2025
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
123
6
0
03 Mar 2025
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
Sifan Zhou
Shuo Wang
Zhihang Yuan
Mingjia Shi
Yuzhang Shang
Dawei Yang
ALM
MQ
95
0
0
18 Feb 2025
HadamRNN: Binary and Sparse Ternary Orthogonal RNNs
Armand Foucault
Franck Mamalet
François Malgouyres
MQ
89
0
0
28 Jan 2025
On the Compression of Language Models for Code: An Empirical Study on CodeBERT
Giordano dÁloisio
Luca Traini
Federica Sarro
A. Marco
70
1
0
18 Dec 2024
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Xu Ouyang
Tao Ge
Thomas Hartvigsen
Zhisong Zhang
Haitao Mi
Dong Yu
MQ
103
5
0
26 Nov 2024
SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism
Priyansh Bhatnagar
Linfeng Wen
Mingu Kang
39
0
0
15 Nov 2024
Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference
Shashank Nag
Alan T. L. Bacellar
Zachary Susskind
Anshul Jha
Logan Liberty
...
Krishnan Kailas
P. Lima
Neeraja J. Yadwadkar
F. M. G. França
L. John
47
0
0
04 Nov 2024
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
34
10
0
17 Oct 2024
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik
Maxim Fishman
Brian Chmiel
Hilla Ben-Yaacov
Ron Banner
Kfir Y. Levy
MQ
26
0
0
04 Oct 2024
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things
Ziheng Wang
Pedro Reviriego
Farzad Niknia
Javier Conde
Shanshan Liu
Fabrizio Lombardi
MQ
32
2
0
26 Aug 2024
Shifted Window Fourier Transform And Retention For Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
VLM
47
0
0
25 Aug 2024
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness
Xiaojing Fan
Chunliang Tao
AAML
44
28
0
08 Aug 2024
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
Leo Donisch
Sigurd Schacht
Carsten Lanquillon
32
2
0
06 Aug 2024
Designing Efficient LLM Accelerators for Edge Devices
Jude Haris
Rappy Saha
Wenhao Hu
José Cano
34
7
0
01 Aug 2024
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers
Zhengang Li
Alec Lu
Yanyue Xie
Zhenglun Kong
Mengshu Sun
...
Peiyan Dong
Caiwen Ding
Yanzhi Wang
Xue Lin
Zhenman Fang
55
5
0
25 Jul 2024
Inverted Activations
Georgii Sergeevich Novikov
Ivan Oseledets
26
0
0
22 Jul 2024
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji
Chao Fang
Shaobo Ma
Haikuo Shao
Zhongfeng Wang
MQ
49
1
0
16 Jul 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
96
57
0
24 Jun 2024
AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers
Emil Biju
Anirudh Sriram
Mert Pilanci
56
0
0
13 Jun 2024
MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations
Zixiao Wang
Jingwei Zhang
Wenqian Zhao
Farzan Farnia
Bei Yu
AAML
42
3
0
11 Jun 2024
Survey for Landing Generative AI in Social and E-commerce Recsys -- the Industry Perspectives
Da Xu
Danqing Zhang
Guangyu Yang
Bo Yang
Shuyuan Xu
Lingling Zheng
Cindy Liang
32
2
0
10 Jun 2024
VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
Oshin Dutta
Ritvik Gupta
Sumeet Agarwal
54
2
0
07 Jun 2024
Effective Interplay between Sparsity and Quantization: From Theory to Practice
Simla Burcu Harma
Ayan Chakraborty
Elizaveta Kostenok
Danila Mishin
Dongho Ha
...
Martin Jaggi
Ming Liu
Yunho Oh
Suvinay Subramanian
Amir Yazdanbakhsh
MQ
49
6
0
31 May 2024
STAT: Shrinking Transformers After Training
Megan Flynn
Alexander Wang
Dean Edward Alvarez
Christopher De Sa
Anil Damle
46
2
0
29 May 2024
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models
Yang Zhang
Yawei Li
Xinpeng Wang
Qianli Shen
Barbara Plank
Bernd Bischl
Mina Rezaei
Kenji Kawaguchi
63
9
0
28 May 2024
LoQT: Low Rank Adapters for Quantized Training
Sebastian Loeschcke
M. Toftrup
M. Kastoryano
Serge Belongie
Vésteinn Snæbjarnarson
MQ
44
3
0
26 May 2024
Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information
Yanshu Wang
Wenyang He
Tong Yang
MQ
20
1
0
24 May 2024
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
Jianzong Wang
Ziqi Liang
Xulong Zhang
Ning Cheng
Jing Xiao
40
0
0
30 Apr 2024
Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy
Yijin Liu
Fandong Meng
Jie Zhou
AI4CE
34
7
0
10 Apr 2024
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Jerry Yao-Chieh Hu
Pei-Hsuan Chang
Haozheng Luo
Hong-Yu Chen
Weijian Li
Wei-Po Wang
Han Liu
49
26
0
04 Apr 2024
Efficiently Distilling LLMs for Edge Applications
Achintya Kundu
Fabian Lim
Aaron Chew
L. Wynter
Penny Chong
Rhui Dih Lee
55
6
0
01 Apr 2024
Accurate Block Quantization in LLMs with Outliers
Nikita Trukhanov
I. Soloveychik
MQ
31
4
0
29 Mar 2024
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
Hyungjun Oh
Kihong Kim
Jaemin Kim
Sungkyun Kim
Junyeol Lee
Du-Seong Chang
Jiwon Seo
41
28
0
15 Mar 2024
LookupFFN: Making Transformers Compute-lite for CPU inference
Zhanpeng Zeng
Michael Davies
Pranav Pulijala
Karthikeyan Sankaralingam
Vikas Singh
38
5
0
12 Mar 2024
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Hao Kang
Qingru Zhang
Souvik Kundu
Geonhwa Jeong
Zaoxing Liu
Tushar Krishna
Tuo Zhao
MQ
49
79
0
08 Mar 2024
The Impact of Quantization on the Robustness of Transformer-based Text Classifiers
Seyed Parsa Neshaei
Yasaman Boreshban
Gholamreza Ghassem-Sani
Seyed Abolghasem Mirroshandel
MQ
49
0
0
08 Mar 2024
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Hanlin Tang
Yifu Sun
Decheng Wu
Kai Liu
Jianchen Zhu
Zhanhui Kang
MQ
28
11
0
05 Mar 2024
C
3
C^3
C
3
: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding
Taixi Lu
Haoyu Wang
Huajie Shao
Jing Gao
Huaxiu Yao
41
0
0
25 Feb 2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
Yuxiang Huang
Xu Han
Wang Xu
Chaojun Xiao
Xinrong Zhang
Yewei Fang
Kaihuo Zhang
Zhiyuan Liu
Maosong Sun
45
11
0
21 Feb 2024
Quantized Embedding Vectors for Controllable Diffusion Language Models
Cheng Kang
Xinye Chen
Yong Hu
Daniel Novak
31
0
0
15 Feb 2024
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
MQ
46
48
0
15 Feb 2024
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
Abhimanyu Bambhaniya
Amir Yazdanbakhsh
Suvinay Subramanian
Sheng-Chun Kao
Shivani Agrawal
Utku Evci
Tushar Krishna
59
14
0
07 Feb 2024
A Survey on Transformer Compression
Yehui Tang
Yunhe Wang
Jianyuan Guo
Zhijun Tu
Kai Han
Hailin Hu
Dacheng Tao
46
30
0
05 Feb 2024
A Comprehensive Survey of Compression Algorithms for Language Models
Seungcheol Park
Jaehyeon Choi
Sojin Lee
U. Kang
MQ
36
12
0
27 Jan 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
52
128
0
26 Jan 2024
BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge
Yuhao Ji
Chao Fang
Zhongfeng Wang
37
3
0
22 Jan 2024
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization
Rahul Chand
Yashoteja Prabhu
Pratyush Kumar
22
3
0
20 Dec 2023
1
2
3
4
5
6
7
Next