ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.07648
  4. Cited By
Characterization of Large Language Model Development in the Datacenter

Characterization of Large Language Model Development in the Datacenter

12 March 2024
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
Qiaoling Chen
Peng Sun
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
ArXivPDFHTML

Papers citing "Characterization of Large Language Model Development in the Datacenter"

14 / 14 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Yongqian Li
Haojie Wang
Biao Hou
Jidong Zhai
40
0
0
12 May 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Yuhang Zhang
Yixuan Wang
Shouda Liu
MLLM
MoE
67
1
0
31 Mar 2025
Green Prompting
Green Prompting
Marta Adamska
Daria Smirnova
Hamid Nasiri
Zhengxin Yu
Peter Garraghan
160
0
0
09 Mar 2025
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng
Xiang Shi
Zhuo Jiang
X. Zhang
Lei Zhang
...
Fuliang Li
Shuguang Wang
H. Lin
Jianxi Ye
Minlan Yu
LRM
142
2
0
04 Nov 2024
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
Shri Kiran Srinivasan
Kalyan Saladi
Carole-Jean Wu
149
7
0
29 Oct 2024
A Survey on Failure Analysis and Fault Injection in AI Systems
A Survey on Failure Analysis and Fault Injection in AI Systems
Guangba Yu
Gou Tan
Haojia Huang
Zhenyu Zhang
Pengfei Chen
Roberto Natella
Zibin Zheng
36
3
0
28 Jun 2024
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive
  Validation
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Yifan Xiong
Yuting Jiang
Ziyue Yang
L. Qu
Guoshuai Zhao
...
Luke Melton
Joe Chau
Peng Cheng
Yongqiang Xiong
Lidong Zhou
52
6
0
09 Feb 2024
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
314
3,248
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,953
0
04 Mar 2022
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
Ramachandran Ramjee
Nipun Kwatra
GNN
31
80
0
07 Nov 2021
Carbon Emissions and Large Neural Network Training
Carbon Emissions and Large Neural Network Training
David A. Patterson
Joseph E. Gonzalez
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
244
644
0
21 Apr 2021
tf.data: A Machine Learning Data Processing Framework
tf.data: A Machine Learning Data Processing Framework
D. Murray
Jiří Šimša
Ana Klimovic
Ihor Indyk
PINN
AI4CE
LMTD
39
87
0
28 Jan 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
177
414
0
18 Jan 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
1