ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.09761
  4. Cited By
Efficient Transformers with Dynamic Token Pooling

Efficient Transformers with Dynamic Token Pooling

17 November 2022
Piotr Nawrot
J. Chorowski
Adrian Lañcucki
Edoardo Ponti
ArXivPDFHTML

Papers citing "Efficient Transformers with Dynamic Token Pooling"

48 / 48 papers shown
Title
SuperBPE: Space Travel for Language Models
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
102
7
0
17 Mar 2025
Neural Attention Search
Neural Attention Search
Difan Deng
Marius Lindauer
126
0
0
21 Feb 2025
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
75
4
0
28 Oct 2024
Emergent Abilities of Large Language Models
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
261
2,462
0
15 Jun 2022
Variable-rate hierarchical CPC leads to acoustic unit discovery in
  speech
Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
Santiago Cuervo
Adrian Lañcucki
R. Marxer
Paweł Rychlikowski
J. Chorowski
SSL
42
13
0
05 Jun 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya A. Ramesh
Prafulla Dhariwal
Alex Nichol
Casey Chu
Mark Chen
VLM
DiffM
339
6,830
0
13 Apr 2022
Memorizing Transformers
Memorizing Transformers
Yuhuai Wu
M. Rabe
DeLesley S. Hutchins
Christian Szegedy
RALM
68
177
0
16 Mar 2022
SCROLLS: Standardized CompaRison Over Long Language Sequences
SCROLLS: Standardized CompaRison Over Long Language Sequences
Uri Shaham
Elad Segal
Maor Ivgi
Avia Efrat
Ori Yoran
...
Ankit Gupta
Wenhan Xiong
Mor Geva
Jonathan Berant
Omer Levy
RALM
79
137
0
10 Jan 2022
Hierarchical Transformers Are More Efficient Language Models
Hierarchical Transformers Are More Efficient Language Models
Piotr Nawrot
Szymon Tworkowski
Michał Tyrolski
Lukasz Kaiser
Yuhuai Wu
Christian Szegedy
Henryk Michalewski
59
64
0
26 Oct 2021
Revisiting the Uniform Information Density Hypothesis
Revisiting the Uniform Information Density Hypothesis
Clara Meister
Tiago Pimentel
Patrick Haller
Lena Jäger
Ryan Cotterell
R. Levy
76
75
0
23 Sep 2021
Combiner: Full Attention Transformer with Sparse Computation Cost
Combiner: Full Attention Transformer with Sparse Computation Cost
Hongyu Ren
H. Dai
Zihang Dai
Mengjiao Yang
J. Leskovec
Dale Schuurmans
Bo Dai
105
79
0
12 Jul 2021
Charformer: Fast Character Transformers via Gradient-based Subword
  Tokenization
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Yi Tay
Vinh Q. Tran
Sebastian Ruder
Jai Gupta
Hyung Won Chung
Dara Bahri
Zhen Qin
Simon Baumgartner
Cong Yu
Donald Metzler
86
158
0
23 Jun 2021
Segmental Contrastive Predictive Coding for Unsupervised Word
  Segmentation
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
Saurabhchand Bhati
Jesús Villalba
Piotr Żelasko
Laureano Moro-Velazquez
Najim Dehak
SSL
55
37
0
03 Jun 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Linting Xue
Aditya Barua
Noah Constant
Rami Al-Rfou
Sharan Narang
Mihir Kale
Adam Roberts
Colin Raffel
83
502
0
28 May 2021
FNet: Mixing Tokens with Fourier Transforms
FNet: Mixing Tokens with Fourier Transforms
James Lee-Thorp
Joshua Ainslie
Ilya Eckstein
Santiago Ontanon
90
526
0
09 May 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
49
277
0
22 Mar 2021
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
  Representation
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
J. Clark
Dan Garrette
Iulia Turc
John Wieting
85
218
0
11 Mar 2021
Rethinking Attention with Performers
Rethinking Attention with Performers
K. Choromanski
Valerii Likhosherstov
David Dohan
Xingyou Song
Andreea Gane
...
Afroz Mohiuddin
Lukasz Kaiser
David Belanger
Lucy J. Colwell
Adrian Weller
167
1,570
0
30 Sep 2020
Self-Supervised Contrastive Learning for Unsupervised Phoneme
  Segmentation
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
Felix Kreuk
Joseph Keshet
Yossi Adi
SSL
50
79
0
27 Jul 2020
Linformer: Self-Attention with Linear Complexity
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
185
1,694
0
08 Jun 2020
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
  Language Processing
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Zihang Dai
Guokun Lai
Yiming Yang
Quoc V. Le
76
233
0
05 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
609
41,736
0
28 May 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
128
4,048
0
10 Apr 2020
Byte Pair Encoding is Suboptimal for Language Model Pretraining
Byte Pair Encoding is Suboptimal for Language Model Pretraining
Kaj Bostrom
Greg Durrett
61
209
0
07 Apr 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
294
596
0
12 Mar 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
526
4,773
0
23 Jan 2020
Unsupervised Cross-lingual Representation Learning at Scale
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Edouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
193
6,538
0
05 Nov 2019
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek
Marie-Anne Lachaux
Alexis Conneau
Vishrav Chaudhary
Francisco Guzmán
Armand Joulin
Edouard Grave
81
654
0
01 Nov 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
93
1,894
0
23 Apr 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
192
3,724
0
09 Jan 2019
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
175
3,514
0
19 Aug 2018
Representation Learning with Contrastive Predictive Coding
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRL
SSL
284
10,253
0
10 Jul 2018
Subword Regularization: Improving Neural Network Translation Models with
  Multiple Subword Candidates
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo
195
1,165
0
29 Apr 2018
Neural Speed Reading via Skim-RNN
Neural Speed Reading via Skim-RNN
Minjoon Seo
Sewon Min
Ali Farhadi
Hannaneh Hajishirzi
65
79
0
06 Nov 2017
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Victor Campos
Brendan Jou
Xavier Giró-i-Nieto
Jordi Torres
Shih-Fu Chang
47
218
0
22 Aug 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
628
130,942
0
12 Jun 2017
Fast-Slow Recurrent Neural Networks
Fast-Slow Recurrent Neural Networks
Asier Mujika
Florian Meier
Angelika Steger
73
77
0
24 May 2017
Categorical Reparameterization with Gumbel-Softmax
Categorical Reparameterization with Gumbel-Softmax
Eric Jang
S. Gu
Ben Poole
BDL
281
5,360
0
03 Nov 2016
The Concrete Distribution: A Continuous Relaxation of Discrete Random
  Variables
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Chris J. Maddison
A. Mnih
Yee Whye Teh
BDL
157
2,529
0
02 Nov 2016
Surprisal-Driven Zoneout
Surprisal-Driven Zoneout
K. Rocki
Tomasz Kornuta
Tegan Maharaj
46
8
0
24 Oct 2016
Hierarchical Multiscale Recurrent Neural Networks
Hierarchical Multiscale Recurrent Neural Networks
Junyoung Chung
Sungjin Ahn
Yoshua Bengio
BDL
84
536
0
06 Sep 2016
Gaussian Error Linear Units (GELUs)
Gaussian Error Linear Units (GELUs)
Dan Hendrycks
Kevin Gimpel
165
4,994
0
27 Jun 2016
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
David M. Krueger
Tegan Maharaj
János Kramár
Mohammad Pezeshki
Nicolas Ballas
Nan Rosemary Ke
Anirudh Goyal
Yoshua Bengio
Aaron Courville
C. Pal
67
317
0
03 Jun 2016
Adaptive Computation Time for Recurrent Neural Networks
Adaptive Computation Time for Recurrent Neural Networks
Alex Graves
84
6
0
29 Mar 2016
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
195
7,729
0
31 Aug 2015
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
John Schulman
N. Heess
T. Weber
Pieter Abbeel
OffRL
133
392
0
17 Jun 2015
U-Net: Convolutional Networks for Biomedical Image Segmentation
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger
Philipp Fischer
Thomas Brox
SSeg
3DV
1.6K
76,917
0
18 May 2015
A Clockwork RNN
A Clockwork RNN
Jan Koutník
Klaus Greff
Faustino J. Gomez
Jürgen Schmidhuber
80
500
0
14 Feb 2014
1