An optimized RDMA QP communication mechanism for hyperscale AI infrastructure


연구 분야: Infrastructure



학회: Cluster Computing


초록

The current artificial intelligence (AI) infrastructure widely employs remote direct memory access (RDMA) protocol for high-performance communication in networks, utilizing Reliable Connection (RC)-based Queue Pairs (QP) to ensure end-to-end correct and ordered data transmission. However, as the scale of AI infrastructure continues to expand, this RC-based QP communication mechanism faces deficiencies in scalability and is prone to congestion, resulting in degraded network transfer performance. In this paper, we propose an optimized RDMA QP communication mechanism to address scalability and congestion issues in hyper-scale AI infrastructure networks. Firstly, we replace RC-based QPs with Reliable Datagram (RD)-based QPs and propose a new reliable mechanism to address scalability problems, eliminating the need for repetitive QP establishment by AI processes during external communication. Additionally, to mitigate congestion caused by a single path, we implement multipath data transmission by introducing a new unordered reception method in the network software stack. Through experiments and simulation tests, the optimized RDMA QP communication in large-scale AI infrastructure exhibits exceptional scalability and significantly reduces the occurrence of congestion, resulting in an overall network performance improvement of over 15%.


Author Profile
Junliang Wang

China Telecom Research Institute China Telecom Co. Ltd. Guangzhou 510660 China

China
Author Profile
Baohong Lin

China Telecom Research Institute China Telecom Co. Ltd. Guangzhou 510660 China

China
Author Profile
Jiao Zhang

State Key Laboratory of Networking and Switching Technology BUPT Beijing 100876 China

Andorra

📄 논문 정보

발행 연도 2024년
인용수 0
출판 국가 Andorra, China
사이트 Springer
좋아요 수 0

연관 논문 목록 (53건)