Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport


연구 분야: Networking



학회: 2020 IEEE Symposium on High-Performance Interconnects (HOTI)


초록

Recommendation model DNNs have gained significant attention due to their vital role in recommending the best content to the user. However, in order to further increase accuracy, DNNs are becoming more complex with more data to be trained, making them infeasible for training on a single node. Distributed training is a solution to tackle this problem by employing multiple nodes for training. The importance of recommendation models necessitates to design customized HW/SW platforms for training such networks in order to minimize the communication overheads among different nodes. However, exploring this design space is difficult due to the presence of many HW/SW parameters and the limitations to change the HW parameters in real systems. In this paper, we port the previously proposed ASTRA-SIM simulation platform on top of the versatile NS3 network simulator by introducing a portable network interface for ASTRA-SIM. Using NS3 enables modeling a wide variety of networks with much better accuracy. Furthermore, we enhance NS3 with detailed modeling of TCP/IP. Finally, we study various HW/SW platforms for the DLRM recommendation model with TCP/IP as the network protocol and analyze the communication overheads in the presence of various interconnect configurations.


Author Profile
Saeed Rashidi

Georgia Institute of Technology Atlanta USA

Georgia
Author Profile
Pallavi Shurpali

Facebook Menlo Park USA

United States
Author Profile
Srinivas Sridharan

Facebook Menlo Park USA

United States

📄 논문 정보

발행 연도 2020년
인용수 7
출판 국가 Georgia, United States
사이트 IEEE
좋아요 수 0

연관 논문 목록 (18건)