Failure Detection and Diagnosis in
Architecture-based Autonomic Systems
Paulo Casanova.
PhD thesis, Nr. (CMU-S3D-12-100), Software and Societal Systems Department, School of Computer Science, Carnegie Mellon University, April 2023.
Online links: Plain Text
Abstract
As the size and complexity of modern IT systems increases, there is greater
need for automatic recovery from failures. Recently, self-adaptive control loops have
started to replace human oversight as means to ensure high availability of software
systems. Two critical pieces of the self-adaptive loop for high availability are failure
identification and fault localization. Failure identification – figuring out that some-
thing is not working – is a challenging activity as (1) the monitoring is not done at
the same abstraction level as the failures manifest themselves, and (2) because sys-
tems perform several activities concurrently, incorrect behavior will appear mixed
with correct behavior. Identifying faults, pinpointing the source of the failure, is also
challenging as (1) there may be multiple explanations for a fault and (2) diagnosis
must be performed in a useful time frame. In this thesis, we propose to improve self-
diagnosis through a framework that allows a system to identify failures and pinpoint
the corresponding faulty parts in a running system. This framework is based on two
key principles: reasoning about the system’s behavior at the software architecture
level and providing a declarative approach to describe system behavior. The use of
architectural models allows the diagnostic infrastructure to scale gracefully, supports
efficient run-time execution of common fault localization algorithms, and supports
failure diagnosis of system-level properties such as end-to-end performance. The
use of a declarative approach to behavior allows one to systematically specify rules
for bridging the gap between low-level monitoring and higher-level problem detec-
tion. It also supports reuse across systems that share a common architectural style or
implementation infrastructure. |
Keywords: Diagnosis, Self-adaptation.
|
|