연구 분야: Databases
학회: International Conference on Database and Expert Systems Applications
Supercomputers come in a variety of sizes and architectures with thousands of interconnected nodes. Most organizations are required to produce metrics for their funding sources to prove that these machines are being utilized and meeting the availability requirements. While tracking the state of an individual server is trivial, measuring uptime of a supercomputer with several thousand nodes spanning tens to hundreds of cabinets and rows with one or more mounted file systems is a complex task. Additionally, supercomputers have diverse architectures and System Logic (which includes unique characteristics of the machine itself such as networking topology, size, partitions, hardware layout, physical configuration and component hierarchy). These constraints complicate the computation of standardized metrics such as Mean Time To Failure (MTTI), Mean Time to Failure (MTTF), availability, and utilization. At the Argonne Leadership Computing Facility (ALCF), we developed a tool that standardizes the analyses of these machines so that these metrics can be computed accurately and efficiently. We call this tool Operational Data Processing System (ODPS), and use it to process the data that Theta, a 4,392 node Cray XC40, generates. In addition to the XC40, this tool also works with Mira, a 49,152 node IBM BG/Q system that ALCF houses. This paper explores how ODPS processes the data from Theta and Mira, including the storage design decisions and architecture-independent approach to metric calculations. We quantitatively evaluate our approach, comparing it to alternative methods for storing and processing supercomputer machine state in the database.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Israel |
| 사이트 | Springer |
| 좋아요 수 | 0 |