Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.13685
Cited By
MoM: Linear Sequence Modeling with Mixture-of-Memories
19 February 2025
Jusen Du
Weigao Sun
Disen Lan
Jiaxi Hu
Yu Cheng
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MoM: Linear Sequence Modeling with Mixture-of-Memories"
44 / 44 papers shown
Title
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
Xiaoye Qu
Yafu Li
Zhaochen Su
Weigao Sun
Jianhao Yan
...
Chaochao Lu
Yue Zhang
Xian-Sheng Hua
Bowen Zhou
Yu Cheng
ReLM
OffRL
LRM
124
35
0
27 Mar 2025
Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Weigao Sun
Disen Lan
Tong Zhu
Xiaoye Qu
Yu Cheng
MoE
169
2
0
07 Mar 2025
Liger: Linearizing Large Language Models to Gated Recurrent Structures
Disen Lan
Weigao Sun
Jiaxi Hu
Jusen Du
Yu Cheng
92
0
0
03 Mar 2025
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
Weigao Sun
Disen Lan
Yiran Zhong
Xiaoye Qu
Yu Cheng
48
4
0
11 Feb 2025
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax
Aonian Li
Bangwei Gong
Bo Yang
Bo Shen
...
Zhan Qin
Zhenhua Fan
Zhihang Yu
Z. L. Jiang
Zijia Wu
MoE
89
37
0
14 Jan 2025
LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
Xiaoye Qu
Daize Dong
Xuyang Hu
Tong Zhu
Weigao Sun
Yu Cheng
MoE
106
12
0
24 Nov 2024
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Yu Zhang
Aaron Courville
Ruijie Zhu
Yue Zhang
Leyang Cui
...
Freda Shi
Bailin Wang
Wei Bi
P. Zhou
Guohong Fu
90
19
0
11 Sep 2024
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun
Xinhao Li
Karan Dalal
Jiarui Xu
Arjun Vikram
...
Xinlei Chen
Xiaolong Wang
Sanmi Koyejo
Tatsunori Hashimoto
Carlos Guestrin
106
99
0
05 Jul 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
87
223
0
25 Jun 2024
Scaling Laws for Linear Complexity Language Models
Xuyang Shen
Dong Li
Ruitao Leng
Zhen Qin
Weigao Sun
Yiran Zhong
LRM
41
6
0
24 Jun 2024
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
Tong Zhu
Xiaoye Qu
Daize Dong
Jiacheng Ruan
Jingqi Tong
Conghui He
Yu Cheng
MoE
ALM
61
75
0
24 Jun 2024
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao
Albert Gu
Mamba
49
480
0
31 May 2024
Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective
Zhen Qin
Xuyang Shen
Weigao Sun
Dong Li
Stanley T. Birchfield
Leonid Sigal
Yiran Zhong
63
6
0
27 May 2024
Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention
Zhen Qin
Weigao Sun
Dong Li
Xuyang Shen
Weixuan Sun
Yiran Zhong
51
9
0
27 May 2024
HGRN2: Gated Linear RNNs with State Expansion
Zhen Qin
Aaron Courville
Weixuan Sun
Xuyang Shen
Dong Li
Weigao Sun
Yiran Zhong
LRM
63
50
0
11 Apr 2024
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Bo Peng
Daniel Goldstein
Quentin G. Anthony
Alon Albalak
Eric Alcaide
...
Bingchen Zhao
Qihang Zhao
Peng Zhou
Jian Zhu
Ruijie Zhu
63
77
0
08 Apr 2024
MS-Net: A Multi-Path Sparse Model for Motion Prediction in Multi-Scenes
Xiaqiang Tang
Weigao Sun
Siyuan Hu
Yiyang Sun
Yafeng Guo
67
5
0
01 Mar 2024
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De
Samuel L. Smith
Anushan Fernando
Aleksandar Botev
George-Christian Muraru
...
David Budden
Yee Whye Teh
Razvan Pascanu
Nando de Freitas
Çağlar Gülçehre
Mamba
84
127
0
29 Feb 2024
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
Weigao Sun
Zhen Qin
Weixuan Sun
Shidi Li
Dong Li
Xuyang Shen
Yu Qiao
Yiran Zhong
OffRL
80
10
0
29 Jan 2024
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai
Chengqi Deng
Chenggang Zhao
R. X. Xu
Huazuo Gao
...
Panpan Huang
Fuli Luo
Chong Ruan
Zhifang Sui
W. Liang
MoE
56
271
0
11 Jan 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
64
161
0
11 Dec 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
68
2,552
0
01 Dec 2023
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai
Xin Lv
Jiajie Zhang
Hong Lyu
Jiankai Tang
...
Aohan Zeng
Lei Hou
Yuxiao Dong
Jie Tang
Juanzi Li
LLMAG
RALM
56
548
0
28 Aug 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
531
13,788
0
15 Mar 2023
Resurrecting Recurrent Neural Networks for Long Sequences
Antonio Orvieto
Samuel L. Smith
Albert Gu
Anushan Fernando
Çağlar Gülçehre
Razvan Pascanu
Soham De
202
282
0
11 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
709
12,840
0
27 Feb 2023
Diagonal State Spaces are as Effective as Structured State Spaces
Ankit Gupta
Albert Gu
Jonathan Berant
88
300
0
27 Mar 2022
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
137
2,307
0
20 Apr 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
57
2,136
0
11 Jan 2021
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin
HyoukJoong Lee
Yuanzhong Xu
Dehao Chen
Orhan Firat
Yanping Huang
M. Krikun
Noam M. Shazeer
Zhiwen Chen
MoE
76
1,142
0
30 Jun 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
107
1,734
0
29 Jun 2020
GLU Variants Improve Transformer
Noam M. Shazeer
107
968
0
12 Feb 2020
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk
Rowan Zellers
Ronan Le Bras
Jianfeng Gao
Yejin Choi
OOD
LRM
84
1,724
0
26 Nov 2019
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari
Jeff Rasley
Olatunji Ruwase
Yuxiong He
ALM
AI4CE
60
852
0
04 Oct 2019
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
72
2,373
0
19 May 2019
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Dheeru Dua
Yizhong Wang
Pradeep Dasigi
Gabriel Stanovsky
Sameer Singh
Matt Gardner
AIMat
68
933
0
01 Mar 2019
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar
Robin Jia
Percy Liang
RALM
ELM
187
2,830
0
11 Jun 2018
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELM
RALM
LRM
69
2,474
0
14 Mar 2018
Decoupled Weight Decay Regularization
I. Loshchilov
Frank Hutter
OffRL
101
2,118
0
14 Nov 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
443
129,831
0
12 Jun 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
173
2,610
0
09 May 2017
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam M. Shazeer
Azalia Mirhoseini
Krzysztof Maziarz
Andy Davis
Quoc V. Le
Geoffrey E. Hinton
J. Dean
MoE
158
2,614
0
23 Jan 2017
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
166
2,814
0
26 Sep 2016
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno
Germán Kruszewski
Angeliki Lazaridou
Q. N. Pham
Raffaella Bernardi
Sandro Pezzelle
Marco Baroni
Gemma Boleda
Raquel Fernández
76
698
0
20 Jun 2016
1