ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.06925
  4. Cited By
Chimera: Efficiently Training Large-Scale Neural Networks with
  Bidirectional Pipelines

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

14 July 2021
Shigang Li
Torsten Hoefler
    GNN
    AI4CE
    LRM
ArXivPDFHTML

Papers citing "Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines"

38 / 38 papers shown
Title
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Jared Fernandez
Luca Wehrstedt
Leonid Shamis
Mostafa Elhoushi
Kalyan Saladi
Yonatan Bisk
Emma Strubell
Jacob Kahn
431
3
0
20 Nov 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
114
5
0
24 Oct 2024
Optimizing Large Model Training through Overlapped Activation Recomputation
Optimizing Large Model Training through Overlapped Activation Recomputation
Ping Chen
Wenjie Zhang
Shuibing He
Yingjie Gu
Zhuwei Peng
...
Yi Zheng
Zhefeng Wang
Yanlong Yin
Gang Chen
Gang Chen
76
6
0
13 Jun 2024
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction
Lei Guan
Dongsheng Li
Jiye Liang
Wenjian Wang
Wenjian Wang
Xicheng Lu
77
1
0
01 Dec 2023
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning
Jiasheng Ye
Zaixiang Zheng
Yu Bao
Lihua Qian
Quanquan Gu
DiffM
109
17
0
23 Aug 2023
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
Hao Lin
Ke Wu
Jie Li
Jun Yu Li
Wu-Jun Li
58
2
0
31 Jul 2023
Sparsity in Deep Learning: Pruning and growth for efficient inference
  and training in neural networks
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler
Dan Alistarh
Tal Ben-Nun
Nikoli Dryden
Alexandra Peste
MQ
293
712
0
31 Jan 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
244
427
0
18 Jan 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple
  and Efficient Sparsity
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
83
2,168
0
11 Jan 2021
Pre-Trained Image Processing Transformer
Pre-Trained Image Processing Transformer
Hanting Chen
Yunhe Wang
Tianyu Guo
Chang Xu
Yiping Deng
Zhenhua Liu
Siwei Ma
Chunjing Xu
Chao Xu
Wen Gao
VLM
ViT
131
1,671
0
01 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
538
40,739
0
22 Oct 2020
An In-Depth Analysis of the Slingshot Interconnect
An In-Depth Analysis of the Slingshot Interconnect
Daniele De Sensi
Salvatore Di Girolamo
K. McMahon
Duncan Roweth
Torsten Hoefler
34
97
0
20 Aug 2020
DAPPLE: A Pipelined Data Parallel Approach for Training Large Models
DAPPLE: A Pipelined Data Parallel Approach for Training Large Models
Shiqing Fan
Yi Rong
Chen Meng
Zongyan Cao
Siyu Wang
...
Jun Yang
Lixue Xia
Lansong Diao
Xiaoyong Liu
Wei Lin
82
235
0
02 Jul 2020
Memory-Efficient Pipeline-Parallel DNN Training
Memory-Efficient Pipeline-Parallel DNN Training
Deepak Narayanan
Amar Phanishayee
Kaiyu Shi
Xie Chen
Matei A. Zaharia
MoE
71
214
0
16 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
680
41,736
0
28 May 2020
End-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
357
13,002
0
26 May 2020
Breaking (Global) Barriers in Parallel Stochastic Optimization with
  Wait-Avoiding Group Averaging
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Shigang Li
Tal Ben-Nun
Giorgi Nadiradze
Salvatore Di Girolamo
Nikoli Dryden
Dan Alistarh
Torsten Hoefler
48
15
0
30 Apr 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
384
42,299
0
03 Dec 2019
PipeMare: Asynchronous Pipeline Parallel DNN Training
PipeMare: Asynchronous Pipeline Parallel DNN Training
Bowen Yang
Jian Zhang
Jonathan Li
Christopher Ré
Christopher R. Aberger
Christopher De Sa
52
111
0
09 Oct 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
310
1,892
0
17 Sep 2019
Red-blue pebbling revisited: near optimal parallel matrix-matrix
  multiplication
Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication
Grzegorz Kwa'sniewski
Marko Kabić
Maciej Besta
J. VandeVondele
R. Solcà
Torsten Hoefler
LRM
40
92
0
26 Aug 2019
Taming Unbalanced Training Workloads in Deep Learning with Partial
  Collective Operations
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Shigang Li
Tal Ben-Nun
Salvatore Di Girolamo
Dan Alistarh
Torsten Hoefler
99
58
0
12 Aug 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
216
993
0
01 Apr 2019
Large-Batch Training for LSTM and Beyond
Large-Batch Training for LSTM and Beyond
Yang You
Jonathan Hseu
Chris Ying
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
52
89
0
24 Jan 2019
Stochastic Gradient Push for Distributed Deep Learning
Stochastic Gradient Push for Distributed Deep Learning
Mahmoud Assran
Nicolas Loizou
Nicolas Ballas
Michael G. Rabbat
73
344
0
27 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.6K
94,511
0
11 Oct 2018
Beyond Data and Model Parallelism for Deep Neural Networks
Beyond Data and Model Parallelism for Deep Neural Networks
Zhihao Jia
Matei A. Zaharia
A. Aiken
GNN
AI4CE
56
502
0
14 Jul 2018
Demystifying Parallel and Distributed Deep Learning: An In-Depth
  Concurrency Analysis
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Tal Ben-Nun
Torsten Hoefler
GNN
55
707
0
26 Feb 2018
SparCML: High-Performance Sparse Communication for Machine Learning
SparCML: High-Performance Sparse Communication for Machine Learning
Cédric Renggli
Saleh Ashkboos
Mehdi Aghagolzadeh
Dan Alistarh
Torsten Hoefler
58
126
0
22 Feb 2018
Horovod: fast and easy distributed deep learning in TensorFlow
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev
Mike Del Balso
97
1,221
0
15 Feb 2018
Regularized Evolution for Image Classifier Architecture Search
Regularized Evolution for Image Classifier Architecture Search
Esteban Real
A. Aggarwal
Yanping Huang
Quoc V. Le
150
3,025
0
05 Feb 2018
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Xiangru Lian
Wei Zhang
Ce Zhang
Ji Liu
ODL
42
501
0
18 Oct 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
642
130,942
0
12 Jun 2017
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal
Piotr Dollár
Ross B. Girshick
P. Noordhuis
Lukasz Wesolowski
Aapo Kyrola
Andrew Tulloch
Yangqing Jia
Kaiming He
3DH
120
3,675
0
08 Jun 2017
AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks
AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks
Alexander L. Gaunt
Matthew W. Johnson
M. Riechert
Daniel Tarlow
Ryota Tomioka
Dimitrios Vytiniotis
Sam Webster
BDL
49
31
0
27 May 2017
Optimization Methods for Large-Scale Machine Learning
Optimization Methods for Large-Scale Machine Learning
Léon Bottou
Frank E. Curtis
J. Nocedal
223
3,205
0
15 Jun 2016
Training Deep Nets with Sublinear Memory Cost
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen
Bing Xu
Chiyuan Zhang
Carlos Guestrin
98
1,165
0
21 Apr 2016
One weird trick for parallelizing convolutional neural networks
One weird trick for parallelizing convolutional neural networks
A. Krizhevsky
GNN
88
1,298
0
23 Apr 2014
1