ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.05861
  4. Cited By
DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep
  Learning Clusters in the Cloud
v1v2 (latest)

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

9 March 2024
Yoochan Kim
Kihyun Kim
Yonghyeon Cho
Jinwoo Kim
Awais Khan
Ki-Dong Kang
B. An
Myung-Hoon Cha
H. Kim
Youngjae Kim
ArXiv (abs)PDFHTML

Papers citing "DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud"

9 / 9 papers shown
Title
Cloud Cost Optimization: A Comprehensive Review of Strategies and Case
  Studies
Cloud Cost Optimization: A Comprehensive Review of Strategies and Case Studies
Saurabh Deochake
29
11
0
24 Jul 2023
SpotLake: Diverse Spot Instance Dataset Archive Service
SpotLake: Diverse Spot Instance Dataset Archive Service
Sungjae Lee
Jaeil Hwang
Kyungyong Lee
60
13
0
07 Feb 2022
Boost Neural Networks by Checkpoints
Boost Neural Networks by Checkpoints
Feng Wang
Gu-Yeon Wei
Qiao Liu
Jinxiang Ou
Xian Wei
Hairong Lv
FedMLUQCV
52
10
0
03 Oct 2021
A Study of Checkpointing in Large Scale Training of Deep Neural Networks
A Study of Checkpointing in Large Scale Training of Deep Neural Networks
Elvis Rojas
A. Kahira
Esteban Meneses
L. Bautista-Gomez
Rosa M. Badia
53
25
0
01 Dec 2020
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Shen Li
Yanli Zhao
R. Varma
Omkar Salpekar
P. Noordhuis
...
Adam Paszke
Jeff Smith
Brian Vaughan
Pritam Damania
Soumith Chintala
OODMoE
66
188
0
28 Jun 2020
Growing Together: Modeling Human Language Learning With n-Best
  Multi-Checkpoint Machine Translation
Growing Together: Modeling Human Language Learning With n-Best Multi-Checkpoint Machine Translation
El Moatez Billah Nagoudi
Muhammad Abdul-Mageed
H. Cavusoglu
28
2
0
07 Jun 2020
torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models
torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models
Chiheon Kim
Heungsub Lee
Myungryong Jeong
Woonhyuk Baek
Boogeon Yoon
Ildoo Kim
Sungbin Lim
Sungwoong Kim
MoEAI4CE
46
54
0
21 Apr 2020
Horovod: fast and easy distributed deep learning in TensorFlow
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev
Mike Del Balso
102
1,222
0
15 Feb 2018
Checkpoint Ensembles: Ensemble Methods from a Single Training Process
Checkpoint Ensembles: Ensemble Methods from a Single Training Process
Hugh Chen
Scott M. Lundberg
Su-In Lee
UQCV
55
63
0
09 Oct 2017
1