ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.06188
  4. Cited By
Q8BERT: Quantized 8Bit BERT

Q8BERT: Quantized 8Bit BERT

14 October 2019
Ofir Zafrir
Guy Boudoukh
Peter Izsak
Moshe Wasserblat
    MQ
ArXivPDFHTML

Papers citing "Q8BERT: Quantized 8Bit BERT"

50 / 304 papers shown
Title
HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models
HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models
Junxuan Zhang
Rongxiang Weng
Haoyang Li
Lidan Shou
Ke Chen
Gang Chen
Qin Xie
Guiming Xie
Xuejian Gong
33
0
0
24 Apr 2025
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
Ye Qiao
Zhiheng Cheng
Yian Wang
Yifan Zhang
Yunzhe Deng
Sitao Huang
79
0
0
22 Apr 2025
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Jiajun Li
Yixing Xu
Haiduo Huang
Xuanwu Yin
D. Li
Edith C. -H. Ngai
E. Barsoum
61
0
0
13 Mar 2025
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
123
6
0
03 Mar 2025
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
Sifan Zhou
Shuo Wang
Zhihang Yuan
Mingjia Shi
Yuzhang Shang
Dawei Yang
ALM
MQ
95
0
0
18 Feb 2025
HadamRNN: Binary and Sparse Ternary Orthogonal RNNs
HadamRNN: Binary and Sparse Ternary Orthogonal RNNs
Armand Foucault
Franck Mamalet
François Malgouyres
MQ
89
0
0
28 Jan 2025
On the Compression of Language Models for Code: An Empirical Study on
  CodeBERT
On the Compression of Language Models for Code: An Empirical Study on CodeBERT
Giordano dÁloisio
Luca Traini
Federica Sarro
A. Marco
70
1
0
18 Dec 2024
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for
  Quantized LLMs with 100T Training Tokens
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Xu Ouyang
Tao Ge
Thomas Hartvigsen
Zhisong Zhang
Haitao Mi
Dong Yu
MQ
103
5
0
26 Nov 2024
SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models
  using Soft-Thresholding Mechanism
SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism
Priyansh Bhatnagar
Linfeng Wen
Mingu Kang
39
0
0
15 Nov 2024
Shrinking the Giant : Quasi-Weightless Transformers for Low Energy
  Inference
Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference
Shashank Nag
Alan T. L. Bacellar
Zachary Susskind
Anshul Jha
Logan Liberty
...
Krishnan Kailas
P. Lima
Neeraja J. Yadwadkar
F. M. G. França
L. John
47
0
0
04 Nov 2024
Active-Dormant Attention Heads: Mechanistically Demystifying
  Extreme-Token Phenomena in LLMs
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
34
10
0
17 Oct 2024
EXAQ: Exponent Aware Quantization For LLMs Acceleration
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik
Maxim Fishman
Brian Chmiel
Hilla Ben-Yaacov
Ron Banner
Kfir Y. Levy
MQ
26
0
0
04 Oct 2024
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning
  for Internet of Things
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things
Ziheng Wang
Pedro Reviriego
Farzad Niknia
Javier Conde
Shanshan Liu
Fabrizio Lombardi
MQ
32
2
0
26 Aug 2024
Shifted Window Fourier Transform And Retention For Image Captioning
Shifted Window Fourier Transform And Retention For Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
VLM
47
0
0
25 Aug 2024
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency,
  Performance, and Adversarial Robustness
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness
Xiaojing Fan
Chunliang Tao
AAML
44
28
0
08 Aug 2024
Inference Optimizations for Large Language Models: Effects, Challenges,
  and Practical Considerations
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
Leo Donisch
Sigurd Schacht
Carsten Lanquillon
32
2
0
06 Aug 2024
Designing Efficient LLM Accelerators for Edge Devices
Designing Efficient LLM Accelerators for Edge Devices
Jude Haris
Rappy Saha
Wenhao Hu
José Cano
34
7
0
01 Aug 2024
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for
  Vision Transformers
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers
Zhengang Li
Alec Lu
Yanyue Xie
Zhenglun Kong
Mengshu Sun
...
Peiyan Dong
Caiwen Ding
Yanzhi Wang
Xue Lin
Zhenman Fang
55
5
0
25 Jul 2024
Inverted Activations
Inverted Activations
Georgii Sergeevich Novikov
Ivan Oseledets
26
0
0
22 Jul 2024
Co-Designing Binarized Transformer and Hardware Accelerator for
  Efficient End-to-End Edge Deployment
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji
Chao Fang
Shaobo Ma
Haikuo Shao
Zhongfeng Wang
MQ
49
1
0
16 Jul 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
96
57
0
24 Jun 2024
AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers
AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers
Emil Biju
Anirudh Sriram
Mert Pilanci
56
0
0
13 Jun 2024
MoreauPruner: Robust Pruning of Large Language Models against Weight
  Perturbations
MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations
Zixiao Wang
Jingwei Zhang
Wenqian Zhao
Farzan Farnia
Bei Yu
AAML
42
3
0
11 Jun 2024
Survey for Landing Generative AI in Social and E-commerce Recsys -- the
  Industry Perspectives
Survey for Landing Generative AI in Social and E-commerce Recsys -- the Industry Perspectives
Da Xu
Danqing Zhang
Guangyu Yang
Bo Yang
Shuyuan Xu
Lingling Zheng
Cindy Liang
32
2
0
10 Jun 2024
VTrans: Accelerating Transformer Compression with Variational
  Information Bottleneck based Pruning
VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
Oshin Dutta
Ritvik Gupta
Sumeet Agarwal
54
2
0
07 Jun 2024
Effective Interplay between Sparsity and Quantization: From Theory to Practice
Effective Interplay between Sparsity and Quantization: From Theory to Practice
Simla Burcu Harma
Ayan Chakraborty
Elizaveta Kostenok
Danila Mishin
Dongho Ha
...
Martin Jaggi
Ming Liu
Yunho Oh
Suvinay Subramanian
Amir Yazdanbakhsh
MQ
49
6
0
31 May 2024
STAT: Shrinking Transformers After Training
STAT: Shrinking Transformers After Training
Megan Flynn
Alexander Wang
Dean Edward Alvarez
Christopher De Sa
Anil Damle
46
2
0
29 May 2024
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language
  Models
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models
Yang Zhang
Yawei Li
Xinpeng Wang
Qianli Shen
Barbara Plank
Bernd Bischl
Mina Rezaei
Kenji Kawaguchi
63
9
0
28 May 2024
LoQT: Low Rank Adapters for Quantized Training
LoQT: Low Rank Adapters for Quantized Training
Sebastian Loeschcke
M. Toftrup
M. Kastoryano
Serge Belongie
Vésteinn Snæbjarnarson
MQ
44
3
0
26 May 2024
Athena: Efficient Block-Wise Post-Training Quantization for Large
  Language Models Using Second-Order Matrix Derivative Information
Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information
Yanshu Wang
Wenyang He
Tong Yang
MQ
20
1
0
24 May 2024
EfficientASR: Speech Recognition Network Compression via Attention
  Redundancy and Chunk-Level FFN Optimization
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
Jianzong Wang
Ziqi Liang
Xulong Zhang
Ning Cheng
Jing Xiao
40
0
0
30 Apr 2024
Accelerating Inference in Large Language Models with a Unified Layer
  Skipping Strategy
Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy
Yijin Liu
Fandong Meng
Jie Zhou
AI4CE
34
7
0
10 Apr 2024
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Jerry Yao-Chieh Hu
Pei-Hsuan Chang
Haozheng Luo
Hong-Yu Chen
Weijian Li
Wei-Po Wang
Han Liu
49
26
0
04 Apr 2024
Efficiently Distilling LLMs for Edge Applications
Efficiently Distilling LLMs for Edge Applications
Achintya Kundu
Fabian Lim
Aaron Chew
L. Wynter
Penny Chong
Rhui Dih Lee
55
6
0
01 Apr 2024
Accurate Block Quantization in LLMs with Outliers
Accurate Block Quantization in LLMs with Outliers
Nikita Trukhanov
I. Soloveychik
MQ
31
4
0
29 Mar 2024
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
Hyungjun Oh
Kihong Kim
Jaemin Kim
Sungkyun Kim
Junyeol Lee
Du-Seong Chang
Jiwon Seo
41
28
0
15 Mar 2024
LookupFFN: Making Transformers Compute-lite for CPU inference
LookupFFN: Making Transformers Compute-lite for CPU inference
Zhanpeng Zeng
Michael Davies
Pranav Pulijala
Karthikeyan Sankaralingam
Vikas Singh
38
5
0
12 Mar 2024
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless
  Generative Inference of LLM
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Hao Kang
Qingru Zhang
Souvik Kundu
Geonhwa Jeong
Zaoxing Liu
Tushar Krishna
Tuo Zhao
MQ
49
79
0
08 Mar 2024
The Impact of Quantization on the Robustness of Transformer-based Text
  Classifiers
The Impact of Quantization on the Robustness of Transformer-based Text Classifiers
Seyed Parsa Neshaei
Yasaman Boreshban
Gholamreza Ghassem-Sani
Seyed Abolghasem Mirroshandel
MQ
49
0
0
08 Mar 2024
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
Hanlin Tang
Yifu Sun
Decheng Wu
Kai Liu
Jianchen Zhu
Zhanhui Kang
MQ
28
11
0
05 Mar 2024
$C^3$: Confidence Calibration Model Cascade for Inference-Efficient
  Cross-Lingual Natural Language Understanding
C3C^3C3: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding
Taixi Lu
Haoyu Wang
Huajie Shao
Jing Gao
Huaxiu Yao
41
0
0
25 Feb 2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster
  Speculative Decoding
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
Yuxiang Huang
Xu Han
Wang Xu
Chaojun Xiao
Xinrong Zhang
Yewei Fang
Kaihuo Zhang
Zhiyuan Liu
Maosong Sun
45
11
0
21 Feb 2024
Quantized Embedding Vectors for Controllable Diffusion Language Models
Quantized Embedding Vectors for Controllable Diffusion Language Models
Cheng Kang
Xinye Chen
Yong Hu
Daniel Novak
31
0
0
15 Feb 2024
Model Compression and Efficient Inference for Large Language Models: A
  Survey
Model Compression and Efficient Inference for Large Language Models: A Survey
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
MQ
46
48
0
15 Feb 2024
Progressive Gradient Flow for Robust N:M Sparsity Training in
  Transformers
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
Abhimanyu Bambhaniya
Amir Yazdanbakhsh
Suvinay Subramanian
Sheng-Chun Kao
Shivani Agrawal
Utku Evci
Tushar Krishna
59
14
0
07 Feb 2024
A Survey on Transformer Compression
A Survey on Transformer Compression
Yehui Tang
Yunhe Wang
Jianyuan Guo
Zhijun Tu
Kai Han
Hailin Hu
Dacheng Tao
46
30
0
05 Feb 2024
A Comprehensive Survey of Compression Algorithms for Language Models
A Comprehensive Survey of Compression Algorithms for Language Models
Seungcheol Park
Jaehyeon Choi
Sojin Lee
U. Kang
MQ
36
12
0
27 Jan 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
52
128
0
26 Jan 2024
BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge
BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge
Yuhao Ji
Chao Fang
Zhongfeng Wang
37
3
0
22 Jan 2024
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
  Weight Factorization
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization
Rahul Chand
Yashoteja Prabhu
Pratyush Kumar
22
3
0
20 Dec 2023
1234567
Next