연구 분야: Analysis
학회: SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis
Identifying system hardware failures and anomalies is a unique challenge in heterogeneous testbed clusters because of variation in the ways that the system log reports errors and warnings. We present a novel approach for the real-time classification of syslog messages generated by a heterogeneous testbed cluster to proactively identify potential hardware issues and security events. By integrating machine learning models with high-performance computing systems, our system facilitates continuous system health monitoring. The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally, we experiment with using large language models as a message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 1 |
| 출판 국가 | United States |
| 사이트 | ACM |
| 좋아요 수 | 0 |