ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.14509
  4. Cited By
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
  Long Sequence Transformer Models
v1v2 (latest)

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

25 September 2023
S. A. Jacobs
Masahiro Tanaka
Chengming Zhang
Minjia Zhang
L. Song
Samyam Rajbhandari
Yuxiong He
ArXiv (abs)PDFHTML

Papers citing "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models"

23 / 23 papers shown
Title
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
Dennis Liu
Zijie Yan
Xin Yao
Tong Liu
V. Korthikanti
...
Jiajie Yao
Chandler Zhou
David Wu
Xipeng Li
J. Yang
MoE
148
0
0
21 Apr 2025
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Juntao Zhao
Qi Lu
Wei Jia
Borui Wan
Lei Zuo
...
Size Zheng
Yanghua Peng
H. Lin
Xin Liu
Chuan Wu
AI4CE
132
0
0
14 Apr 2025
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Yuxiang Huang
Mingye Li
Xu Han
Chaojun Xiao
Weilin Zhao
Sun Ao
Hao Zhou
Jie Zhou
Zhiyuan Liu
Maosong Sun
93
0
0
17 Feb 2025
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Hangliang Ding
Dacheng Li
Runlong Su
Peiyuan Zhang
Zhijie Deng
Ion Stoica
Hao Zhang
VGen
123
9
0
10 Feb 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
454
7
0
04 Feb 2025
Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques
Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques
Nathaniel Tomczak
Sanmukh Kuppannagari
202
0
0
31 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
Kai Zhang
Chong Chen
Fan Yang
Yue Yang
Lili Qiu
122
45
0
03 Jan 2025
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Jared Fernandez
Luca Wehrstedt
Leonid Shamis
Mostafa Elhoushi
Kalyan Saladi
Yonatan Bisk
Emma Strubell
Jacob Kahn
524
4
0
20 Nov 2024
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Wei Wu
Zhuoshi Pan
Chao Wang
L. Chen
Y. Bai
Kun Fu
Zehua Wang
Hui Xiong
Hui Xiong
LLMAG
148
7
0
05 Nov 2024
Context Parallelism for Scalable Million-Token Inference
Context Parallelism for Scalable Million-Token Inference
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
MoELRM
135
7
0
04 Nov 2024
How to Train Long-Context Language Models (Effectively)
How to Train Long-Context Language Models (Effectively)
Tianyu Gao
Alexander Wettig
Howard Yen
Danqi Chen
RALM
166
48
0
03 Oct 2024
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
Jinghan Yao
Sam Ade Jacobs
Masahiro Tanaka
Olatunji Ruwase
Hari Subramoni
D. Panda
89
2
0
30 Aug 2024
Real-Time Video Generation with Pyramid Attention Broadcast
Real-Time Video Generation with Pyramid Attention Broadcast
Xuanlei Zhao
Xiaolong Jin
Kai Wang
Yang You
VGenDiffM
148
45
0
22 Aug 2024
YaRN: Efficient Context Window Extension of Large Language Models
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng
Jeffrey Quesnelle
Honglu Fan
Enrico Shippole
OSLM
76
264
0
31 Aug 2023
Sequence Parallelism: Long Sequence Training from System Perspective
Sequence Parallelism: Long Sequence Training from System Perspective
Shenggui Li
Fuzhao Xue
Chaitanya Baranwal
Yongbin Li
Yang You
73
102
0
26 May 2021
BookSum: A Collection of Datasets for Long-form Narrative Summarization
BookSum: A Collection of Datasets for Long-form Narrative Summarization
Wojciech Kry'sciñski
Nazneen Rajani
Divyansh Agarwal
Caiming Xiong
Dragomir R. Radev
RALM
109
154
0
18 May 2021
Rethinking Attention with Performers
Rethinking Attention with Performers
K. Choromanski
Valerii Likhosherstov
David Dohan
Xingyou Song
Andreea Gane
...
Afroz Mohiuddin
Lukasz Kaiser
David Belanger
Lucy J. Colwell
Adrian Weller
186
1,602
0
30 Sep 2020
Big Bird: Transformers for Longer Sequences
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
556
2,103
0
28 Jul 2020
Memory-Efficient Pipeline-Parallel DNN Training
Memory-Efficient Pipeline-Parallel DNN Training
Deepak Narayanan
Amar Phanishayee
Kaiyu Shi
Xie Chen
Matei A. Zaharia
MoE
83
216
0
16 Jun 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALMVLM
179
4,100
0
10 Apr 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
336
1,918
0
17 Sep 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
129
1,916
0
23 Apr 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
  Minima
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
433
2,946
0
15 Sep 2016
1