연구 분야: Networking
학회: CCF Transactions on High Performance Computing
With supercomputing and intelligent computing convergence, the Supercomputer Internet is proposed to build, deploy, and run convergence applications using cloud-native technologies. Message Passing Interface (MPI) is a representative class of supercomputing applications in parallel computing environments. Live migration is the process of transferring a running application to a different physical location with minimal downtime that enables a number of useful application management capabilities such as load balancing, resource consolidation, and fault tolerance. While several works have been studying live migration for MPI workloads, most require modifying the operating system kernel, which hinders its broader adoption in data centers. This paper uses container technology and the CRIU tool to implement checkpointing and restarting a single container in MPI containerized environments, while ensuring the continuous execution of the MPI program. The paper has validated the feasibility of live migration for MPI workloads by testing with NAS Parallel Benchmarks (NPB), LAMMPS, and GROMACS. The paper discusses the impact of migration on MPI timing functions and proposes solutions. The paper observes a slight improvement in MPI computational performance due to migration, while also noting an increase in communication latency during the iterative process.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | China |
| 사이트 | Springer |
| 좋아요 수 | 0 |