Sparse Universal Transformer

Sparse Universal Transformer

11 October 2023

Aaron Courville

Chuang Gan

Papers citing "Sparse Universal Transformer"

6 / 6 papers shown

Title
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study Shawn Tan Songlin Yang Aaron Courville Rameswar Panda Yikang Shen 30 4 0 23 Oct 2024
Investigating Recurrent Transformers with Dynamic Halt Jishnu Ray Chowdhury Cornelia Caragea 39 1 0 01 Feb 2024
Mixture of Attention Heads: Selecting Attention Heads Per Token Xiaofeng Zhang Songlin Yang Zeyu Huang Jie Zhou Wenge Rong Zhang Xiong MoE 99 42 0 11 Oct 2022
Compositional Semantic Parsing with Large Language Models Andrew Drozdov Nathanael Scharli Ekin Akyuurek Nathan Scales Xinying Song Xinyun Chen Olivier Bousquet Denny Zhou ReLM LRM 200 92 0 29 Sep 2022
Neural Networks and the Chomsky Hierarchy Grégoire Delétang Anian Ruoss Jordi Grau-Moya Tim Genewein L. Wenliang ... Chris Cundy Marcus Hutter Shane Legg Joel Veness Pedro A. Ortega UQCV 107 131 0 05 Jul 2022
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 264 4,489 0 23 Jan 2020