ITRT(IT Research Trends)

Congestion control in machine learning clusters

연구 분야: Artificial Intelligence

논문 키워드: #algorithms #unfairness #competing #unfair #decades

학회: HotNets '22: Proceedings of the 21st ACM Workshop on Hot Topics in Networks

초록

This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all competing jobs. We call this specific combination of jobs compatible and define the compatibility criterion using a novel geometric abstraction. Our abstraction rolls time around a circle and rotates the communication phases of jobs to identify fully compatible jobs. Using this abstraction, we demonstrate up to 1.3× improvement in the average training iteration time of popular ML models. We advocate that resource management algorithms should take job compatibility on network links into account. We then propose three directions to ameliorate the impact of network congestion in ML training clusters: (i) an adaptively unfair congestion control scheme, (ii) priority queues on switches, and (iii) precise flow scheduling.

📄 논문 정보

발행 연도	2022년
인용수	21
출판 국가
사이트	ACM
좋아요 수	0

Congestion control in machine learning clusters

Congestion control in machine learning clusters

Sudarsanan Rajasekaran

Manya Ghobadi

Gautam Kumar

Aditya Akella

📄 논문 정보

연관 논문 목록 (196건)

Congestion control in machine learning clusters

Congestion control in machine learning clusters

📄 논문 정보

연관 논문 목록 (196건) 내 서재 담기

연관 논문 목록 (196건)