Article 708, Critical Operations Power Systems – Some Existing Technologies to Assist in Complying

NEC-2008 has a new Article 708, Critical Operations Power Systems (COPS). COPS are defined as “power systems for facilities or parts of facilities that require continuous operation for the reasons of public safety, emergency management, national security, or business continuity.” The article also defines designated critical operations areas (DCOA) as “areas within a facility or site designated as requiring critical operations power. According to the existing article, COPS are classified by municipal, state, federal, or other codes by any governmental agency having jurisdiction or by facility engineering documentation establishing the necessity for such a system.”

The genus of COPS was to address hardening emergency and standby power systems as part of homeland security. The recent terrorist attacks and Hurricane Katrina brought to light the need to reassess the electrical infrastructure protection and reliability. NFPA Standards Council created Code-Making Panel 20 in October 2005. The general direction was to look at the broad picture of issues that could impact the COPS, along the lines of NFPA 1600-2007, Standard on Disaster/Emergency Management and Business Continuity Programs, which provides additional guidance and is referenced in a number of places throughout the article.

Article 708

The original draft of what is now 708 was Article 585. The Technical Correlating Committee decided it applied as a system (vs. special occupancy) so it is now in chapter 7. Unlike Articles 700 and 701, which are primarily developed to allow for safe building egress and to aid firefighters, Article 708 is intended to maintain power continuity, keeping the COPS operational and reliable. This requires a significantly different approach than previous requirements for emergency systems, which were primarily intended to get people out of the building.

The original draft of 585 had a fine print note (FPN) that indicated the type of facility CMP-20 thought could need COPS. Included on the list are air traffic control centers; fire and security system monitoring; hazardous material handling facilities; communication centers and telephone exchanges; emergency evacuation centers; financial, banking, business data processing facilities; fuel supply pumping stations; hospitals; water and sewage treatment facilities; 911 centers; critical government facilities; police, fire and civil defense facilities including power for radio repeater operations; radio and TV stations; and transportation infrastructure.

In the Report on Proposals it states that this FPN was deleted by CMP-20 “because of concern that it could be misapplied in making the determination on which facilities fall within the scope of this article.”

The obvious first challenge the authority having jurisdiction (AHJ) has in applying Article 708 is determining what facilities in their jurisdiction require COPS. Hopefully, the next revision of the NEC will provide better direction and guidance.

Meanwhile, this paper will focus on the second major challenge, which is “how to comply with Article 708” once a facility has been deemed to have the requirement for COPS.

Part I General of 708 has the following sections:

708.1 Scope
708.2 Definitions
708.3 Application of Other Articles
708.4 Risk Assessment
708.5 Physical Security
708.6 Testing and Maintenance
708.8 Commissioning

As can be seen above, Article 708 introduces new types of requirements not normally seen in other sections of theNEC. These include risk assessment, physical security, maintenance and commissioning. Though these may be new types of concepts to theNEC, they represent existing technologies and methodologies used frequently in other industries. Maintenance, though not normally addressed in theNEC, is covered in detail in NFPA 70B, Recommended Practice of Electrical Equipment Maintenance.

The other parts in Article 708 are similar to the rest of theNECand include:

Circuit Wiring and Equipment
Power Sources and Connection
Overcurrent Protection
System Performance and Analysis

This paper will focus on risk assessment. It will show how reliability analysis has and can been used to protect and increase the reliability of the electrical system infrastructure of not only COPS but in general any facility with mission critical data centers. It will also direct the reader to references on this subject that will provide additional information and guidance.

Section 708.4 Risk Assessment requires the following:

Risk Assessment for COPS shall be documented and shall be conducted in accordance with 708.4 (A) through (C).

(A) Conducting Risk Assessment – Identify hazards, the likelihood of their occurrence and the vulnerability of the electrical system to those hazards.

(B) Identification of Hazards – Minimum shall include, but shall not be limited to, the following:

(1) Naturally occurring hazards (geological, meteorological and biological)

(2) Human-caused events (accidental and intentional)

(C) Developing Mitigation Strategy – Based on the results of the risk assessment, a strategy shall be developed and implemented to mitigate hazards.

One of the significant requirements of NEC-2008 Article 708, COPS, is that of documenting and conducting a risk assessment.

In COPS the risk assessment is performed to identify hazards, the likelihood of their occurrence, and the vulnerability of the electrical system to those hazards.

This article will address how to measure (or perform) what one could call a vulnerability analysis for an electrical system.

Annex F of NEC-2008 provides some of the “how to” with availability and reliability information along with direction on implementation of Functional Performance Tests (FPT).

Availability and Reliability for COPS as defined in Annex F:

Availability – the percentage of time that a system is available to perform its function.

Reliability – the probability that an item can perform its intended function for a specified interval under stated conditions.

Maintainability – a measure of how quickly and economically failures can be prevented through preventive maintenance or system operation can be restored following failure.
Availability is calculated by:

Mean time between failures (MTBF) – the average time the equipment performed its intended function between failures.

Mean time to repair (MTTR) – the average time it takes to repair the failure and get the equipment back into service.

Availability does not specify how often an outage occurs. An availability of 0.99999 can be 0.0876 hours (5.3 minutes) of downtime each year or 1.77 hours of downtime every 20 years. Availability is relatively time independent, since it depends upon two factors that are averages: MTBF and MTTR.

Reliability is the probability of successful operation and is a time dependent metric. For a system composed of components that have a relatively constant failure rate, such as most electrical distribution equipment, reliability is an exponentially decaying function of time. Therefore the longer the time interval, the lower the reliability regardless of what the system design is. The better the system design, the higher the probability of successful operation for a longer period of time. The reliability vs. time curve will be flatter for a well designed system than for a poorly designed one.

Reliability and availability are both very important concepts. Reliability predicts how likely a system is to fail. Availability, on the other hand, is a measure (or a future prediction) of what percentage of the time the system will operate properly. The repair time is a significant factor in the determination of availability.

A necessary factor to conducting a risk assessment is to determine or establish how much downtime of the electrical system is acceptable to the AHJ (authority having jurisdiction). The table in Annex F provides a correlation between the hours of downtime and the availability (in number of nines).

Annex F of NEC-2008 also provides direction on how to improve the availability of a COPS, both for existing facilities and new facilities. The methodology includes the reliability analysis of evaluating possible failures of the system by conducting failure modes and effects criticality analysis (FMECA) and/or a fault tree analysis (FTA).

Army Corp of Engineers Power Reliability Enhancement Program (PREP) Training ManualTM 5-698-4, Failure Modes and Effects Criticality Analysis (FMECA) for Command, Control, Communications, Computer, Intelligence, Surveillance, and Reconnaissance (C4ISR) Facilities 29 September 2006, provides the necessary information on how to conduct a FMECA.

Figure 1. Fault Tree Symbols: OR, AND and Initiating Event

Another method that Annex F mentions to perform a risk assessment is called fault tree analysis. The event that is to be investigated is at the top of the “tree” and below it are listed items that can cause the top event to occur. The fault tree uses logic symbols, such as OR gates (in which either event can cause a failure), AND gates (which require both events for the failure) and Initiating Events (which are events like equipment failures, to be evaluated). [See figure 1]

Figure 2. Fault Tree for COPS

Figure 2 shows the first part of a fault tree for COPS that can be used to perform a risk assessment. The hazards listed in Section 708.4 have been diagramed, with the addition of the “equipment failure” block. Equipment failure could also be included under “human caused events.” However, it is an item that can be specifically addressed with reliability analysis much more easily than other types of human caused events, such as operator error.

Figure 3. Fault Tree showing the Initiating Events to be evaluated for Naturally Occurring Hazards

Below each of the major hazards additional parts of the fault tree would be included. Shown below in figure 3 is the expansion of the “naturally occurring hazards” block. This diagram shows all of the Initiating Events that are to be evaluated in the fault tree. The rate at which past failures have occurred (failure rate, FR) for each of the Initialing Events is listed in that figure. For example, assume an earthquake of a magnitude sufficient to put the COPS out of operation has occurred in the area once in the last 58 years. The failure rate for that occurrence would be 0.017 failures / year. Once all of the failure rates have been determined, the fault tree can be calculated and the reliability and availability of the COPS determined.

Figure 4. Electrical One-line diagram

Obtaining a failure rate for some of the items would be relatively easy, since weather has been documented for over 100 years. However, for some of the items, such as infectious agent, it would be much more difficult to determine a failure rate. In areas where failure data is not available, direct reliability and availability cannot be performed. Therefore, mitigation strategies such as preventing access to the COPS will have to address these types of issues without the assistance of reliability analysis.

Reliability analysis can be done by several different methods. Fault Tree was introduced above as a method to perform reliability analysis. This method is quite effective in analyzing a system in which a number of factors (that are relatively independent of each other) can cause a system failure.

The IEEE Gold Book (IEEE Standard 493-2007, Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems) also provides a great deal of information on performing risk assessments. A method used in the IEEE Gold Book to analyze the reliability and availability of critical electrical power systems is reliability block diagrams (RBD). RBD is an effective method to analyze systems with many items that are interrelated, such as an electrical distribution system.

Figure 5. RBD of One-line in figure 4

A simple electrical system that could be used for COPS would consist of utility power and a generator supplying an uninterruptible power supply (UPS) system. The one-line and the associated RBD are shown in figures 4 and 5.

The reliability (probability of success) at the end of one year is much better than at the end of five years.

For the COPS, reliability and availability are very important. It must operate for long periods of time, providing power to systems that perform critical functions. However, it should be noted that even with the best engineering design and technology it may be economically impractical and technically impossible to design a COPS that will never fail over a long period of operation.

Forced outages (FA) may and do occur. When they do, restoring the COPS to operation as quickly and economically as possible is very important. Thus the maintainability characteristics of the COPS will predict how quickly and economically it will be restored to normal operation. That is why reliability, availability and maintainability (RAM) are considered complementary characteristics.

The risk assessment method chosen to analyze a COPS should be appropriate for the system in question and require only a reasonable level of investment given the value of the results. The failure of some components may have little impact on either system function, or on its operating repair costs. A relatively costly analysis may not be justified. When the consequences of failure are catastrophic, every possible effort should be made to make the COPS fail-safe.

Figure 6. Reliability and Availability calculations for figures 4 and 5

The NFPA spotted the need and the means to convey the best practices of the business continuity industry into the public sector emergency preparedness. Article 708, and its optional supporting material, looks a lot like performance-based design—something the building safety community still tends to put at a distance. The science involved in developing a COPS is at least as sophisticated as the multi-disciplinary science advanced by the Society of Fire Protection Engineers and described in chapter 5 of the Life Safety Code (NFPA 101), the Uniform Fire Code (NFPA 1) or the Uniform Building Code (NFPA 5000).

Looking forward to the next revision cycle of the NEC, a logical next step would be to define different levels of COPS to match the different levels of criticality of the various types of facilities. For example, a facility required to provide emergency communication across a large area, such as a 911 call center, would be more important to public safety than an individual police or fire station. Therefore the 911 call center should have more robust COPS than what would be necessary for the individual police or fire station.

Based on the discussion above concerning risk assessment, it would obviously be a major step forward to provide the required reliability and availability for each of these levels as part of the definition. That would assist both the AHJ and the engineer of record for the COPS in fulfilling the intent of Article 708 and providing the appropriate system which was operational when needed.