CENG 525

Fault Tolerant Computing

Fault modeling, testing and redundancy techniques to achieve fault tolerance in computer systems, error detection, failure recovery, error coverage, current research in the field.

Course Objectives

To introduce fault modeling, testing and redundancy techniques to achieve fault tolerance in computer systems

Recommended or Required Reading

Israel Koren, C. Mani Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
B.S. Dhillon, Computer System Reliability, CRC Press, 2013.

Learning Outcomes

1. To be able to understand faults and testing
2. To be able to design and evaluate hardware and software fault tolerance techniques
3. To demonstrate the experience to apply the reliability techniques on safety-critical systems

Week Topics
1 Introduction to fault tolerance
2 Digital circuits and fault modeling
3 Testing for combinational and sequential circuits
4 Testing of microprocessor based systems
5 Error detection, self-checking modules
6 Malfunction diagnosis, redundancy
7 Midterm
8 Software reliability
9 Resilient algorithms
10 Error coverage
11 Vulnerability discovery
12 Failure recovery
13 Current research in the field
14 Current research in the field

Grading

Written Midterm Exam: 20%

Written Final Exam: 30%

Assignments: 10%

Term Project: 40%