Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.04725
Cited By
L
2
^2
2
M: Mutual Information Scaling Law for Long-Context Language Modeling
6 March 2025
Zhuo Chen
Oriol Mayné i Comas
Zhuotao Jin
Di Luo
Marin Soljacic
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling"
24 / 24 papers shown
Title
Explaining Context Length Scaling and Bounds for Language Models
Jingzhe Shi
Qinwei Ma
Hongyi Liu
Hang Zhao
Jeng-Neng Hwang
Lei Li
LRM
227
3
0
03 Feb 2025
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
105
5
0
29 Jul 2024
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun
Xinhao Li
Karan Dalal
Jiarui Xu
Arjun Vikram
...
Xinlei Chen
Xiaolong Wang
Sanmi Koyejo
Tatsunori Hashimoto
Carlos Guestrin
126
111
0
05 Jul 2024
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao
Albert Gu
Mamba
116
535
0
31 May 2024
A Dynamical Model of Neural Scaling Laws
Blake Bordelon
Alexander B. Atanasov
Cengiz Pehlevan
101
44
0
02 Feb 2024
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng
Eric Alcaide
Quentin G. Anthony
Alon Albalak
Samuel Arcadinho
...
Qihang Zhao
P. Zhou
Qinghua Zhou
Jian Zhu
Rui-Jie Zhu
235
609
0
22 May 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
850
9,683
0
28 Jan 2022
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Anders Andreassen
Guy Gur-Ari
Henryk Michalewski
Jacob Austin
...
Aitor Lewkowycz
Maarten Bosma
D. Luan
Charles Sutton
Augustus Odena
ReLM
LRM
183
756
0
30 Nov 2021
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu
Karan Goel
Christopher Ré
217
1,829
0
31 Oct 2021
Explaining Neural Scaling Laws
Yasaman Bahri
Ethan Dyer
Jared Kaplan
Jaehoon Lee
Utkarsh Sharma
78
269
0
12 Feb 2021
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
563
2,103
0
28 Jul 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
203
1,793
0
29 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
889
42,463
0
28 May 2020
The Information Bottleneck Problem and Its Applications in Machine Learning
Ziv Goldfeld
Yury Polyanskiy
66
137
0
30 Apr 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
185
4,100
0
10 Apr 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
638
4,921
0
23 Jan 2020
Quantum Natural Gradient
J. Stokes
J. Izaac
N. Killoran
Giuseppe Carleo
54
410
0
04 Sep 2019
On Mutual Information Maximization for Representation Learning
Michael Tschannen
Josip Djolonga
Paul Kishan Rubenstein
Sylvain Gelly
Mario Lucic
SSL
184
502
0
31 Jul 2019
Adaptive Attention Span in Transformers
Sainbayar Sukhbaatar
Edouard Grave
Piotr Bojanowski
Armand Joulin
79
286
0
19 May 2019
Mutual Information Scaling and Expressive Power of Sequence Models
Huitao Shen
71
18
0
10 May 2019
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
129
1,916
0
23 Apr 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
260
3,747
0
09 Jan 2019
Learning deep representations by mutual information estimation and maximization
R. Devon Hjelm
A. Fedorov
Samuel Lavoie-Marchildon
Karan Grewal
Phil Bachman
Adam Trischler
Yoshua Bengio
SSL
DRL
352
2,672
0
20 Aug 2018
Deep Learning and the Information Bottleneck Principle
Naftali Tishby
Noga Zaslavsky
DRL
217
1,593
0
09 Mar 2015
1