Congestion control in machine learning clusters


연구 분야: Artificial Intelligence



학회: HotNets '22: Proceedings of the 21st ACM Workshop on Hot Topics in Networks


초록

This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all competing jobs. We call this specific combination of jobs compatible and define the compatibility criterion using a novel geometric abstraction. Our abstraction rolls time around a circle and rotates the communication phases of jobs to identify fully compatible jobs. Using this abstraction, we demonstrate up to 1.3× improvement in the average training iteration time of popular ML models. We advocate that resource management algorithms should take job compatibility on network links into account. We then propose three directions to ameliorate the impact of network congestion in ML training clusters: (i) an adaptively unfair congestion control scheme, (ii) priority queues on switches, and (iii) precise flow scheduling.


Author Profile
Sudarsanan Rajasekaran

Massachusetts Institute of Technology

정보 없음
Author Profile
Manya Ghobadi

Massachusetts Institute of Technology

정보 없음
Author Profile
Gautam Kumar

Google

정보 없음

📄 논문 정보

발행 연도 2022년
인용수 21
출판 국가
사이트 ACM
좋아요 수 0

연관 논문 목록 (196건)