An Approach for Efficient Processing of Machine Operational Data


연구 분야: Databases



학회: International Conference on Database and Expert Systems Applications


초록

Supercomputers come in a variety of sizes and architectures with thousands of interconnected nodes. Most organizations are required to produce metrics for their funding sources to prove that these machines are being utilized and meeting the availability requirements. While tracking the state of an individual server is trivial, measuring uptime of a supercomputer with several thousand nodes spanning tens to hundreds of cabinets and rows with one or more mounted file systems is a complex task. Additionally, supercomputers have diverse architectures and System Logic (which includes unique characteristics of the machine itself such as networking topology, size, partitions, hardware layout, physical configuration and component hierarchy). These constraints complicate the computation of standardized metrics such as Mean Time To Failure (MTTI), Mean Time to Failure (MTTF), availability, and utilization. At the Argonne Leadership Computing Facility (ALCF), we developed a tool that standardizes the analyses of these machines so that these metrics can be computed accurately and efficiently. We call this tool Operational Data Processing System (ODPS), and use it to process the data that Theta, a 4,392 node Cray XC40, generates. In addition to the XC40, this tool also works with Mira, a 49,152 node IBM BG/Q system that ALCF houses. This paper explores how ODPS processes the data from Theta and Mira, including the storage design decisions and architecture-independent approach to metric calculations. We quantitatively evaluate our approach, comparing it to alternative methods for storing and processing supercomputer machine state in the database.


Author Profile
Ben Lenard

Argonne National Laboratory Lemont IL USA

Israel
Author Profile
Eric Pershey

DePaul University Chicago IL USA

Israel
Author Profile
Zachary Nault

Argonne National Laboratory Lemont IL USA

Israel

📄 논문 정보

발행 연도 2023년
인용수 0
출판 국가 Israel
사이트 Springer
좋아요 수 0

연관 논문 목록 (60건)