Operating for safety and reliability: Lessons from military aviation

Operators are central to safe and reliable equipment. Few would dispute this statement. There is little substance behind this consensus, however, when it comes to more specific concepts of operators’ role in reliability and how that relates to safety. What is more, many in the process industries are concerned about the experience and capabilities of a young workforce due to the ongoing loss of operations corporate knowledge to retirement. There is great need and great opportunity to define a new operate-for-reliability culture in most process plants. Military aviation provides a relevant case study for organizations that want to build this culture in a young, inexperienced workforce.

Before coming to asset performance management consulting, I was a U.S. Marine Corps KC-130 pilot, maintenance officer, operations officer and commander. The crew that operated our aircraft included two pilots and two-to-four enlisted aircrew for most missions. Most of these aircrew had no more than three years of experience operating their aircraft systems. The instructors that trained these young crews, both in dedicated schools and on the job, were more experienced, but still averaged no more than mid-twenties in age. These young aircrews practiced a very effective model of operating for reliability and safety that can be replicated in the process industries. First, I will give some examples of what this operating model looked like, then I will describe the pillars of training, procedures and culture used to build and sustain it.

Preparing for flight

Before we stepped to the aircraft, each crewmember read through the Aircraft Discrepancy Book, which contained all of the recent work orders for the aircraft, both completed and still open. While the “book” still exists in paper form due to the documentation requirements for flight, many now read it electronically via the computerized maintenance management system (CMMS). Knowing the maintenance status of the aircraft gave us critical information about what to look for during our preflight inspections and what to expect during operations.

Next, we each conducted our own detailed preflight inspections of aircraft systems. We would give an especially close look at any equipment that had been recently repaired to ensure that the job had been completed properly, looking for leaks, unsecured panels and other indications of trouble. If we found any issues, we would request maintenance support to address them before start-up. Once all systems were confirmed to be in proper configuration, the crew chief would sign off on the preflight inspection, a maintenance official would sign to certify that the aircraft was safe for flight, and then the pilot in command would sign to certify that he or she had reviewed all paperwork, inspected the aircraft and accepted it as safe for flight. While this paperwork process would be overkill for many plant systems, some form of partnership and mutually involved handover of safe and reliable equipment from maintenance to operations is a practice that pays huge dividends in safety and reliability.

Start-up: (One of) the most violent things you can do to equipment

After the preflight inspection, we followed a series of checklists to confirm one last time that all systems were configured properly for start-up and the mission ahead. Only then was it was time to start engines and associated systems. This is one of the most violent things that can happen to equipment. Starting or activating most systems involves rapid changes in temperatures, pressures, speeds and related component stresses. Due to these high and rapidly changing stresses, as well as the often-unique operating modes of systems during start-up, the potential for failure is often highest during start-up. Aircraft start is not a simple push of a button. Aircrews study the start sequence extensively and follow instrumentation carefully to ensure that all systems, such as lubrication, hydraulics and temperature control, are operating properly. Vigilance, systems knowledge and a few seconds during this very short window can be the difference between pre-emptively stopping the start to conduct a very minor component reset on one hand, and continuing to the destruction of a very costly turbine engine on the other.

In industry, start-up of certain hazardous processes is very well controlled, however not all equipment starts are carefully monitored. Surprisingly few organizations use simple, proven concepts such as checklists and a cultural commitment to start-up vigilance; to their great detriment. Anyone who has spent time in a plant knows of many start-up failure examples both small (a pump that was cavitated because valves were not lined up properly) and large (a turnaround budget and schedule cratered by an avoidable mishap on restart). Starting equipment, especially after maintenance, is a situation ripe for avoidable error.

Likewise, I have heard far too many people say, “They’re professionals. They don’t need a checklist to tell them what to do.” I still remember much of my checklists a decade after my last flight in a KC-130, but nonetheless we used them every time because professionals know that they can be distracted or otherwise commit errors. Checklists have been accepted by professionals from pilots to brain surgeons as a tool that makes them better. Plant operators would do well to follow suit more regularly.

Operating within limits

Once in the air, we operated according to well-defined limits that we had no excuse to ignore. We had to memorize the most critical limits and were tested on them regularly to maintain flight status. In addition, these limits were marked with placards, color indications on gauges, and they were often alarmed. These limits included things such as maximum continuous engine temperatures, operating weights and G-forces. We knew that these limits were meant to protect the reliability of aircraft systems and structures so that we could depend on their safe operation throughout their service life, not only on the flight today, but on flights in years to come.

In contrast, I have repeatedly seen equipment run too hot, too fast, under too much load and in seriously degraded condition in process plants with little apparent awareness of the short- and long-term consequences. In too many plants, there is a culture that values today’s production to such an extent that the detrimental effects of operating beyond sustainable limits are not so much ignored as they are not even considered. Plant operators who do not consider such limits often put the resulting failures down to poor maintenance. In order to catch up for the lost production, they resolve to whip the ponies even harder once the equipment is repaired, perpetuating the cycle. To use another military example, even machine gunners (not known to be the model of restraint) have the maximum sustained rate of fire drilled into them. They know that production is not sustainable and cannot be counted upon beyond that limit. Operations departments that have adopted equipment operating envelopes are often quite amazed by the results in improved reliability and higher overall production.

Putting the equipment to bed

Once the flight is over, aircrew conduct a detailed post-flight inspection to catch any issues — leaks, damage, etc. — that occurred during the flight. All equipment discrepancies discovered during or after the flight are written up in work requests, known in the jargon as “gripes.” It is a point of pride for most aircrew, backed by cultural expectations, to have the systems knowledge and the troubleshooting ability to write these gripes up clearly for maintenance. In addition, there are a number of other actions taken to “put the aircraft to bed,” such as resetting system configurations, conducting general housekeeping, putting in plugs, caps and other protective equipment, and generally leaving the aircraft in the most reliable and protected state for the next crew.

The pillars of safe and reliable operations

In summary, aircrew have the systems knowledge, the procedures and the culture required to operate their equipment in the safest and most reliable manner. These elements are wholly repeatable in any process plant. As I stated earlier, military aircrew have on average only a few years of experience operating their equipment. While military aircrew training is extensive, it is important to note that this training assumes zero previous technical training or job experience. Process plants can expect at least some technical training and can often bring in operators with some previous work experience to build upon. Furthermore, systems and systems operation training is only a small portion of overall military aircrew training. Much more time is focused on aspects that are not required in the process industries, such as survival skills, weapons and tactics, and conduct of flight duties. The point is that improved systems training is an attainable and critical pillar of safe and reliable operations in the process industries.

For military aircrew, classroom instruction time is far outweighed by continuous, formalized on-the-job training (OJT). Aircrew are always learning and there are systems in place, such as OJT instructor designations and formal tracking of OJT experiences and progress, to get the most out of continued learning. Not only does a formalized OJT structure ensure that your operators will be getting ongoing training and mentorship from properly qualified personnel, but it also instils a culture and expectation that people need to stay curious and keep learning — and that more experienced operators are expected to pass their knowledge on.

Even with improved systems knowledge and ongoing training, procedures and checklists are a basic requirement for safe and reliable operations. Given the myriad distractions and complications in a plant, along with the fact that operators are responsible for a wide variety of equipment, better procedures and checklists should be far more widespread than they are in the industry. They don’t tell operators how to do their jobs. They help operators to remember key details, values and sequences, while also providing a place-keeping tool in the face of workplace distractions and pauses.

This brings us to a final point. In order to be successful, all of these elements — training, procedures and checklists — must exist in a culture that creates an understanding of their value to safe and reliable operations. These elements are mutually reinforcing and individually unsustainable. Ongoing training and systems knowledge builds a culture of safe and reliable operations. In turn, that culture grows to value and continuously improve training and knowledge. Using good procedures and checklists will make operators see their value, which will in turn create a culture where using these tools becomes “the way we do things around here.” Taken together, these things will involve operators much more directly in understanding and protecting the health of their equipment.

In most minds, reliability is a maintenance thing, which is why most reliability efforts make little headway outside of the conduct of maintenance. This mindset is the single biggest obstacle to unlocking significant reliability gains in many organizations. Operations, the owners and operators of a plant’s equipment, have the greatest to gain from high reliability and the most to lose from the lack thereof. Systems training, good procedures and checklists, and a culture of operating for safety and reliability are the pillars that will support sustainable improvements in safe and reliable operations in the process industries.

Peter J. Munson is a Senior Manager and the Global Expertise Lead for Maintenance at T.A. Cook. Previously, he was a career KC-130 pilot, holding maintenance and operations leaderships roles. He is a graduate of the Naval Aviation Maintenance Managers Course and the Safety Commanders Course. He is a Certified Reliability Engineer and Maintenance and Reliability Professional.