Memory Augmented Language Models through Mixture of Word Experts

15 November 2023

Cicero Nogueira dos Santos

Papers citing "Memory Augmented Language Models through Mixture of Word Experts"

24 / 24 papers shown

Title
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions Yiming Du Wenyu Huang Danna Zheng Zhaowei Wang Sébastien Montella Mirella Lapata Kam-Fai Wong Jeff Z. Pan KELM MU 176 4 0 01 May 2025
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts Trevor Gale Deepak Narayanan C. Young Matei A. Zaharia MoE 58 106 0 29 Nov 2022
Knowledge Prompts: Injecting World Knowledge into Language Models through Soft Prompts Cicero Nogueira dos Santos Zhe Dong Daniel Cer John Nham Siamak Shakeri Jianmo Ni Yun-hsuan Sung VLM KELM 44 8 0 10 Oct 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts Basil Mustafa C. Riquelme J. Puigcerver Rodolphe Jenatton N. Houlsby VLM MoE 156 196 0 06 Jun 2022
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra ... Kathy Meier-Hellstern Douglas Eck J. Dean Slav Petrov Noah Fiedel PILM LRM 471 6,231 0 05 Apr 2022
Mixture-of-Experts with Expert Choice Routing Yan-Quan Zhou Tao Lei Han-Chu Liu Nan Du Yanping Huang Vincent Zhao Andrew M. Dai Zhifeng Chen Quoc V. Le James Laudon MoE 284 355 0 18 Feb 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 246 1,345 0 10 Feb 2022
Efficient Large Scale Language Modeling with Mixtures of Experts Mikel Artetxe Shruti Bhosale Naman Goyal Todor Mihaylov Myle Ott ... Jeff Wang Luke Zettlemoyer Mona T. Diab Zornitsa Kozareva Ves Stoyanov MoE 183 195 0 20 Dec 2021
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Nan Du Yanping Huang Andrew M. Dai Simon Tong Dmitry Lepikhin ... Kun Zhang Quoc V. Le Yonghui Wu Zhiwen Chen Claire Cui ALM MoE 209 812 0 13 Dec 2021
Mention Memory: incorporating textual knowledge into Transformers through entity mention attention Michiel de Jong Yury Zemlyanskiy Nicholas FitzGerald Fei Sha William W. Cohen RALM 71 47 0 12 Oct 2021
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts Zhengyan Zhang Yankai Lin Zhiyuan Liu Peng Li Maosong Sun Jie Zhou MoE 78 124 0 05 Oct 2021
Scaling Vision with Sparse Mixture of Experts C. Riquelme J. Puigcerver Basil Mustafa Maxim Neumann Rodolphe Jenatton André Susano Pinto Daniel Keysers N. Houlsby MoE 101 600 0 10 Jun 2021
Hash Layers For Large Sparse Models Stephen Roller Sainbayar Sukhbaatar Arthur Szlam Jason Weston MoE 181 210 0 08 Jun 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models Linting Xue Aditya Barua Noah Constant Rami Al-Rfou Sharan Narang Mihir Kale Adam Roberts Colin Raffel 90 504 0 28 May 2021
Transformer Feed-Forward Layers Are Key-Value Memories Mor Geva R. Schuster Jonathan Berant Omer Levy KELM 149 828 0 29 Dec 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang M. Krikun Noam M. Shazeer Zhiwen Chen MoE 89 1,162 0 30 Jun 2020
How Much Knowledge Can You Pack Into the Parameters of a Language Model? Adam Roberts Colin Raffel Noam M. Shazeer KELM 106 890 0 10 Feb 2020
REALM: Retrieval-Augmented Language Model Pre-Training Kelvin Guu Kenton Lee Zora Tung Panupong Pasupat Ming-Wei Chang RALM 122 2,095 0 10 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 599 4,801 0 23 Jan 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel Noam M. Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu AIMat 419 20,127 0 23 Oct 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems Alex Jinpeng Wang Yada Pruksachatkun Nikita Nangia Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 256 2,312 0 02 May 2019
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Taku Kudo John Richardson 193 3,518 0 19 Aug 2018
FEVER: a large-scale dataset for Fact Extraction and VERification James Thorne Andreas Vlachos Christos Christodoulopoulos Arpit Mittal HILM 145 1,652 0 14 Mar 2018
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension Mandar Joshi Eunsol Choi Daniel S. Weld Luke Zettlemoyer RALM 204 2,646 0 09 May 2017