When a catastrophic failure occurs in a complex system — an explosion at a chemical plant, a structural collapse on a construction site, a critical malfunction in medical equipment — investigators almost always find the same thing. The event wasn't caused by a single, isolated failure. It was the result of multiple contributing factors, often from entirely different parts of the system, converging in a way that nobody anticipated. Understanding how those factors combine, and identifying them before disaster strikes, is exactly what fault tree analysis is designed to do.
Fault tree analysis (FTA) is one of the most widely used and most technically rigorous methods in the safety engineering toolkit. It is used across industries from nuclear power and aerospace to oil and gas, automotive manufacturing, and healthcare — anywhere that the consequences of system failure are severe enough to demand systematic, structured analysis. This article explains what fault tree analysis is, how it works, how to read and construct a fault tree, and how to use it effectively as part of a broader safety management system approach.
Fault tree analysis is a top-down, deductive analytical technique used to identify the combinations of events and conditions that could lead to a specific undesired outcome — typically a catastrophic or safety-critical failure. The analysis begins with that undesired outcome, called the "top event," and works backward through the system to identify all the possible causal pathways that could produce it.
The result is a logical diagram — the fault tree itself — that maps the relationships between the top event and its contributing causes using standardized logic gates. This visual structure makes it possible to see, at a glance, how failures in different parts of a system relate to one another, which combinations of failures are most dangerous, and where protective measures would have the greatest effect.
FTA was developed in the early 1960s at Bell Telephone Laboratories, initially for use in the US Air Force's Minuteman intercontinental ballistic missile program. It was quickly adopted by the aerospace and nuclear industries and has since spread across virtually every sector where complex, high-consequence systems are designed and operated. Decades of application have refined both the methodology and the tools used to conduct it, but the core logic remains unchanged from its original formulation.
Fault tree analysis FTA is used for several distinct purposes within safety engineering and risk management:
Hazard identification and risk assessment — FTA provides a structured method for identifying all plausible failure pathways to a specific top event, ensuring that low-probability but high-consequence combinations are not overlooked.
Design evaluation — During the design phase of a project, FTA can be used to evaluate competing design options, identify single points of failure, and verify that safety requirements are met before construction or manufacture begins.
Quantitative risk assessment — When failure probability data is available, FTA can be used to calculate the probability of the top event occurring, enabling quantitative comparison of risk levels against defined acceptance criteria.
Safety integrity verification — FTA is used to verify that safety instrumented systems and other protective layers achieve the required Safety Integrity Level (SIL) under the IEC 61511 and IEC 61508 functional safety standards.
Incident investigation — FTA can be applied retrospectively to map the causal factors of an incident that has already occurred, providing a structured framework for root cause analysis.
Regulatory compliance — In heavily regulated industries, FTA is often required by regulators as part of the safety case or safety report that operators must submit to demonstrate adequate risk management.
Understanding how fault tree analysis works requires familiarity with its structure, its notation, and the analytical process used to build and evaluate a fault tree.
Every fault tree analysis begins with a clearly defined top event — the specific undesired outcome being analyzed. The top event must be precisely defined: vague top events like "something goes wrong" or "equipment fails" produce vague, unfocused fault trees. A well-defined top event is specific, observable, and unambiguous — for example, "rupture of high-pressure reactor vessel during normal operations" or "uncontrolled release of chlorine gas from storage tank."
Choosing the right top event is critical to the value of the analysis. In most applications, the top event represents a major accident scenario — one that could result in fatalities, serious injuries, significant environmental damage, or major asset loss. The selection is usually informed by prior hazard identification studies, incident history, regulatory requirements, and engineering judgment.
The fault tree is constructed using a standardized set of symbols that represent events and the logical relationships between them. Understanding these symbols is essential to reading and constructing fault trees correctly.
The OR Gate is the most common logic gate in fault tree analysis. An OR gate indicates that the output event occurs if any one of the input events occurs. An OR gate represents a situation where multiple independent failure paths each lead to the same outcome — the failure of any single path is sufficient to produce the output.
The AND Gate indicates that the output event occurs only if all of the input events occur simultaneously. An AND gate represents a situation where multiple failures must coincide — for example, both a primary safety system and its backup must fail at the same time for the top event to occur. AND gates are where redundancy and defense-in-depth show up in a fault tree, and they are critical to understanding the protective value of multiple independent layers of protection.
Basic Events are represented by circles and indicate primary fault events — failures at the lowest level of resolution that the analysis is intended to examine. These are the leaves of the fault tree.
Intermediate Events are represented by rectangles and indicate fault events that result from one or more lower-level events combined through a gate. These sit in the middle of the tree structure.
Undeveloped Events are represented by diamonds and indicate events that are not further developed — either because further analysis is not warranted given their low probability, or because insufficient information is available.
Transfer Symbols are used to connect sections of a large fault tree across multiple pages or diagrams, maintaining logical continuity without requiring the entire tree to be drawn on one sheet.
The construction of a fault tree follows a systematic top-down process. Starting from the top event, the analyst asks: "What are the immediate, necessary, and sufficient causes of this event?" The answer defines the first level of the tree, connected to the top event by the appropriate gate — OR if any one cause is sufficient, AND if all causes must be present simultaneously.
This questioning process is then repeated for each intermediate event, level by level, until all branches of the tree have been developed to the level of basic events. The analyst must maintain a consistent level of resolution throughout — mixing very detailed analysis in some branches with high-level analysis in others creates an unbalanced tree that can produce misleading results.
The development of the fault tree is typically a team activity, drawing on the knowledge of engineers, operators, maintenance personnel, and safety professionals. No single individual has complete knowledge of all the ways a complex system can fail, and diverse perspectives are essential to producing a comprehensive tree.
Once the fault tree is constructed, qualitative evaluation identifies the minimum cut sets of the tree. A cut set is a combination of basic events whose simultaneous occurrence causes the top event. A minimum cut set is the smallest combination — removing any one event from a minimum cut set means the top event would no longer occur from that combination alone.
Minimum cut sets are fundamental to understanding fault tree results. They reveal the most direct pathways to the top event and identify the most critical failure combinations. Single-event minimum cut sets — where one basic event alone is sufficient to cause the top event — represent single points of failure that deserve immediate attention. Two-event minimum cut sets represent combinations that require simultaneous failure of two independent elements, and are inherently more robust.
When failure rate data is available — from industry databases, equipment manufacturer specifications, historical maintenance records, or published reliability data — the fault tree can be used to calculate the probability of the top event occurring over a defined time period.
For OR gates, the probability of the output event is calculated from the probabilities of the input events using the inclusion-exclusion principle. For AND gates, the probability of the output is the product of the input probabilities (assuming independence). For large, complex trees, dedicated software tools are used to perform these calculations efficiently and accurately.
The result is a numerical probability estimate for the top event — for example, a probability of catastrophic failure of 1 × 10⁻⁵ per year. This figure can be compared against defined risk acceptance criteria to determine whether the system as designed is acceptably safe, or whether additional risk reduction measures are required.
Uncertainty in input data must be treated carefully. Failure rate data is rarely precise, and small variations in input values can produce significant variations in output probabilities. Sensitivity analysis — examining how the top event probability changes as input values are varied — is an important part of quantitative FTA.
An example of a fault tree analysis helps make the abstract logic concrete. Consider a simplified scenario involving a pump failure in a water treatment facility.
Top Event: Loss of treated water supply to distribution network.
The analyst identifies two immediate causes: failure of the primary pump, OR failure of the backup pump (since both must fail for supply to be lost, this is actually an AND gate — supply is lost only if the primary pump fails AND the backup pump fails to start or run).
Developing the primary pump failure branch further, the analyst identifies that the primary pump can fail due to mechanical failure of the pump itself, OR electrical supply failure to the pump motor, OR failure of the pump control system to maintain operation.
Developing the backup pump failure branch, the analyst identifies that the backup pump may fail to operate due to the pump being out of service for maintenance at the time of the primary pump failure (an AND condition — it must be undergoing maintenance at the same moment), OR failure of the automatic switchover system that detects primary pump failure and starts the backup, OR mechanical failure of the backup pump itself.
This simplified example immediately reveals something important: the backup pump being out of service for maintenance at the same time the primary pump fails is a minimum cut set of only two events — primary pump failure and backup pump in maintenance. In a real water treatment facility, this would drive a maintenance scheduling policy that prevents both pumps from being simultaneously unavailable. That's fault tree analysis doing exactly what it's designed to do: revealing dangerous combinations that might not be obvious from a less structured analysis.
A well-documented fault tree analysis consists of several components beyond the tree diagram itself.
The scope and objective statement defines the top event precisely, sets the boundaries of the analysis (what systems and failure modes are included), specifies the level of resolution to which the tree will be developed, and identifies the assumptions that underpin the analysis.
The fault tree diagram is the central deliverable — a clear, correctly notated logical diagram that maps the causal structure of the top event. For complex systems, this may run to multiple sheets connected by transfer symbols.
The minimum cut set list is a tabulated summary of all minimum cut sets identified in the qualitative evaluation, ranked by order (single-event cut sets first, then two-event, and so on) and by criticality.
The quantitative results (where applicable) present the calculated top event probability, the contribution of individual basic events and minimum cut sets to the overall probability, sensitivity analysis results, and comparison against risk acceptance criteria.
The recommendations section documents the risk reduction measures identified by the analysis, with priority based on the minimum cut set analysis and quantitative results.
When fault tree analysis is used as part of a safety case or regulatory submission, documentation standards are typically specified by the relevant regulatory framework and must be followed precisely.
Calculating probability in fault tree analysis requires failure rate data for the basic events in the tree, combined with the logical rules that govern how event probabilities combine through gates.
For a basic event, the probability of failure over a given time period T is typically estimated using the exponential distribution: P(failure) = 1 − e^(−λT), where λ is the failure rate (failures per unit time). For small values of λT, this approximates to P ≈ λT.
For an OR gate with two independent input events A and B: P(output) = P(A) + P(B) − P(A) × P(B). For small probabilities, this approximates to P(A) + P(B).
For an AND gate with two independent input events A and B: P(output) = P(A) × P(B). This is why redundant systems are so effective — even if each individual component has a failure probability of 1 × 10⁻², two independent redundant components in an AND configuration produce a combined failure probability of 1 × 10⁻⁴ — a hundredfold reduction.
For systems where components are not fully independent — where a common cause could disable multiple elements simultaneously — common cause failure analysis must be incorporated into the quantitative evaluation. Ignoring common cause failures in AND gate configurations is one of the most significant sources of non-conservatism in quantitative fault tree analysis.
Software tools including ITEM Toolkit, ISOGRAPH FaultTree+, and Reliability Workbench automate the probability calculation process for large trees and provide features for sensitivity analysis, importance measures, and uncertainty quantification.
Fault tree analysis is one of several complementary methods used in safety engineering, and understanding how it relates to other techniques helps practitioners choose the right tool for the job.
FTA vs. FMEA (Failure Mode and Effects Analysis): FMEA is a bottom-up technique that starts from component failure modes and works upward to determine their effects on the system. FTA is top-down, starting from an undesired outcome and working downward to find its causes. The two methods are complementary — FMEA is effective at identifying all the effects of individual component failures, while FTA is more effective at identifying the specific combinations of failures that lead to catastrophic outcomes. They are frequently used together.
FTA vs. HAZOP: HAZOP is a structured team-based technique used primarily to identify hazards in chemical process designs. It is more comprehensive in its coverage of deviation scenarios but less rigorous in its logical modeling of failure combinations. FTA is typically used to analyze specific high-consequence scenarios in greater depth after HAZOP has identified them.
FTA vs. Event Tree Analysis (ETA): ETA is a forward-looking technique that models what happens after an initiating event, analyzing the success or failure of protective systems and emergency responses. FTA and ETA are frequently combined in what is called a Bowtie model — the fault tree sits on the left side of the bowtie, modeling the causes of the top event, while the event tree sits on the right side, modeling the consequences. Together they provide a complete picture of both prevention and mitigation.
Even experienced practitioners can make errors in FTA that undermine the quality of results.
Poorly defined top events are the most fundamental mistake. A top event that is too vague or too broad produces a sprawling, unmanageable tree that fails to provide actionable insights.
Inconsistent resolution — developing some branches to component level while leaving others at system level — produces a tree where quantitative results are unreliable and minimum cut sets are incomplete.
Ignoring common cause failures in quantitative analysis leads to optimistic probability estimates for redundant systems that share common elements — common power supplies, common maintenance crews, common environmental exposures.
Treating the fault tree as a one-time deliverable rather than a living document is a missed opportunity. As systems change, as incident history accumulates, and as new failure mode data becomes available, fault trees should be updated to reflect current reality.
Confusing OR and AND gates — perhaps the most technically consequential error — produces completely incorrect logical structure and leads to conclusions that are the opposite of what the actual system behavior warrants.
A fault tree analysis is a top-down, deductive analytical technique used to systematically identify all the combinations of events and conditions that could lead to a specific undesired outcome in a complex system. It represents those causal relationships in a logical diagram — the fault tree — using standardized gates and symbols that make the relationships between failure causes visually clear and mathematically tractable.
FTA should be used when the consequences of system failure are potentially severe and when the causal structure of failure is complex enough that it cannot be reliably understood through informal analysis alone. It is most valuable during the design and development phase of projects, where it can identify single points of failure and dangerous combinations before they are built into physical systems, but it remains valuable throughout the operational life of a facility as a tool for verifying the integrity of protective systems, evaluating proposed changes, and investigating incidents.
Specific triggers for conducting a fault tree analysis include the identification of a major accident hazard through a HAZOP or other preliminary hazard study, a regulatory requirement for quantitative risk assessment, a management decision to verify that a safety instrumented system meets its required Safety Integrity Level, and the occurrence of a significant incident or near miss where the causal structure is not immediately apparent. In each of these situations, FTA provides a structured, documented, and defensible basis for understanding risk and making decisions about risk reduction.
Fault tree analysis FTA is a specific risk assessment methodology that uses deductive logic — working backward from an undesired outcome to its causes — to model the failure behavior of complex systems. It is distinguished from other risk assessment techniques by its combination of top-down logical structure, visual representation, and the capacity for both qualitative and quantitative evaluation.
The most important distinction is between FTA and FMEA (Failure Mode and Effects Analysis). FMEA is inductive and bottom-up: it starts with individual component failure modes and traces their effects upward through the system. FTA is deductive and top-down: it starts with a defined undesired outcome and traces its causes downward. Neither technique is universally superior — they are complementary, and the most thorough risk assessments use both. FMEA excels at comprehensive coverage of component-level failures; FTA excels at modeling specific catastrophic scenarios and identifying dangerous failure combinations.
FTA also differs from HAZOP, which is primarily a qualitative hazard identification technique using structured guideword prompts to examine process deviations. HAZOP is broader in its coverage but shallower in its logical modeling. HAZOP is typically used earlier in the design process to identify hazard scenarios; FTA is used to model specific scenarios identified by HAZOP in greater analytical depth.
Finally, FTA differs from Event Tree Analysis (ETA), which models what happens after an initiating event — tracing the success or failure of protective responses. FTA addresses the causes of top events; ETA addresses their consequences. The Bowtie model, widely used in process safety, combines both: FTA on the left analyzing causes, ETA on the right analyzing consequences, with the top event as the bow knot connecting them.
An example of a fault tree analysis drawn from a common industrial setting illustrates both the method and its practical value.
Consider the analysis of an uncontrolled fire in a solvent storage room at a manufacturing facility. This is the top event. Working downward, the analyst identifies that an uncontrolled fire requires both an ignition source AND flammable vapor present at sufficient concentration to sustain combustion — an AND gate, since both conditions must be present simultaneously.
Developing the ignition source branch: an ignition source could be present due to a hot work activity being conducted in or near the room without proper permit controls, OR an electrical fault in non-explosion-rated equipment, OR static discharge during solvent transfer operations, OR an uncontrolled heat source such as a space heater. This branch uses an OR gate — any one source is sufficient to provide ignition.
Developing the flammable vapor branch: flammable vapor at sufficient concentration could be present due to a spill from a damaged container, OR a leak from a fitting or valve on the solvent distribution system, OR overfilling of a container during transfer operations, OR inadequate ventilation that allows vapor to accumulate to flammable concentration even from minor evaporative losses. Again, an OR gate.
The qualitative evaluation reveals several important minimum cut sets. The combination of inadequate ventilation AND a static discharge during transfer is a two-event cut set that deserves attention because both events are relatively plausible in isolation. Hot work without permit controls is potentially a single-event minimum cut set if the hot work itself creates both an ignition source and disturbs a container — which drives a strict hot work permitting requirement. These insights directly translate into specific, prioritized risk reduction recommendations.
Calculating probability in fault tree analysis involves combining the failure probabilities of basic events through the logical rules of the gates that connect them. The calculation proceeds from the bottom of the tree upward, computing the probability of each intermediate event until the top event probability is reached.
The starting point is failure rate data for the basic events. This data may come from published reliability databases such as OREDA (Offshore and Onshore Reliability Data), IEEE 493, or MIL-HDBK-217, from equipment manufacturer specifications, from plant maintenance records, or from expert judgment. For each basic event, the probability of failure in the demand period of interest is estimated — typically using the formula P ≈ λT for low failure rates, where λ is the failure rate in failures per unit time and T is the exposure period.
For an OR gate, the output probability is calculated as P(A or B) = P(A) + P(B) − P(A × B). For small probabilities, the last term is negligible and the output is approximately the sum of the input probabilities. For an AND gate, the output probability is P(A and B) = P(A) × P(B), assuming the events are independent. This multiplicative relationship is what makes redundancy so effective — two independent systems each with a failure probability of 10⁻³ produce a combined AND-gate probability of 10⁻⁶.
An important caveat is the independence assumption. AND gate calculations are only valid when input events are genuinely independent. When common cause failures exist — where a single cause could disable multiple supposedly independent elements simultaneously — the simple multiplicative calculation overestimates the protective value of redundancy. Beta factor and multiple Greek letter models are used to account for common cause failure in quantitative FTA.
Sensitivity analysis is always recommended alongside the central probability estimate. Given the uncertainty in failure rate data, it is important to understand how much the top event probability changes when input values are varied within plausible ranges. This provides decision-makers with a realistic picture of the precision of the risk estimate.
Fault tree analysis is a powerful technique, but its value depends entirely on the quality of its execution. Several common mistakes undermine the reliability of FTA results and can lead to incorrect conclusions about system safety.
The most fundamental mistake is a poorly defined top event. When the top event is vague — "major accident" or "equipment failure" — the resulting tree is unfocused and unmanageable. A well-defined top event specifies exactly what has failed, under what conditions, and in what mode. Taking time to precisely define the top event before beginning tree construction pays dividends throughout the analysis.
Inconsistent level of resolution is another frequent error. Developing some branches of the tree to the level of individual component failures while leaving others at the level of subsystem failures produces a tree where minimum cut sets are incomplete and probability calculations are unreliable. The analyst should define the intended level of resolution at the outset and apply it consistently across all branches.
Incorrect gate selection — particularly confusing OR and AND gates — is perhaps the most consequential technical error. An AND gate tells you that all input events must occur simultaneously; an OR gate tells you that any one is sufficient. Swapping them produces a tree with the opposite logical structure to the actual system behavior. Each gate should be verified by asking explicitly: "Would this one event alone cause the output, or do multiple events need to occur together?"
Ignoring common cause failures in quantitative analysis is a systematic source of non-conservatism. Systems designed with apparent redundancy through AND gates may not be as robust as the basic probability calculation suggests if common infrastructure, maintenance practices, or environmental exposures could disable multiple elements simultaneously. Common cause failure analysis should be incorporated into every quantitative FTA involving redundant protective systems.
Finally, treating the fault tree as a static document rather than a living analytical tool is a missed opportunity. Systems change, operational experience accumulates, and failure rate data improves over time. Fault trees that are updated regularly to reflect current system configuration and operational history provide ongoing value; those that are filed away after initial preparation become unreliable and misleading.