Skip to main content

9 principles of a modern preventive maintenance program

WRITEN BY: Erik Hupje,

Whether you are developing a new maintenance program. Or improving the maintenance program for an existing plant. All reliable maintenance programs should be based on the following Principles of Modern Maintenance:

Principle #1: Accept Failures

Principle #2: Most Failures Are Not Age-Related

Principle #3: Some Failures Matter More Than Others

Principle #4: Parts Might Wear Out, But Your Equipment Breaks Down

Principle #5: Hidden Failures Must Be Found

Principle #6: Identical Equipment Does Not Mean Identical Maintenance

Principle #7: “You Can’t Maintain Your Way To Reliability”

Principle #8: Good Maintenance Programs Don’t Waste Your Resources

Principle #9: Good Maintenance Programs Become Better Maintenance Programs

As a Maintenance & Reliability professional, you must understand these principles.

You must practice them.

Principle #1: Accept failures

Not all failures can be prevented by maintenance. Some failures are the result of events outside our control. Think lightning strikes or flooding. For events like these, more or better maintenance makes no difference. Instead, the consequences of events like these should be mitigated through design.

And maintenance can do little about failures that are the result of poor design, lousy construction or bad procurement decisions.

In other cases the impact of the failure is low so you simply accept a failure (think general area lighting).

So, good maintenance programs do not try to prevent all failures. Good maintenance plans and programs accept some level of failures and are prepared to deal with the failures they accept (and deem credible).

Principle #2: Most failures are not age-related

As explained above the research by the airline industry has shown that 70% – 90% of failure modes are not age-related. Instead, for most failure modes the likelihood of occurrence is random. Later research by the United States Navy and others found very similar results.

This research is summarised in the six different failure patterns shown below:

Apart from showing that most failure modes occur randomly. These failure patterns also highlight that infant mortality is common. And that it typically persists. That means that the probability of failure only becomes constant after a significant amount of time in service.

Don’t interpret Curves D, E, and F to mean that (some) items never degrade or wear out. Everything degrades with time, that’s life. But many items degrade so slowly that wear out is not a practical concern. These items do not reach wear out zone in normal operating life.

So what do these patterns tell us about our reliable maintenance programs?

Historically maintenance was done in the belief that the likelihood of failure increased over time (first generation maintenance thinking). It was thought that well-timed maintenance could reduce the likelihood of failure. Turns out that for at least 70% of equipment this simply is not the case.

For the 70% of equipment which has a constant probability of failure, there is no point in doing time-based life-renewal tasks like servicing or replacement.

It makes no sense to spend maintenance resources to service or replace an item whose reliability has not degraded. Or whose reliability cannot be improved by that maintenance task.

In practice, this means that 70% – 90% of equipment would benefit from some form of condition monitoring. And only 10% – 30% can be effectively managed by time-based replacement or overhaul.

Yet most of our PM programs are full of time-based replacements and overhauls.

Principle #3: Some failures matter more than others

When deciding on whether to do a maintenance task consider the consequence of not doing it. What would be the consequence of letting that specific failure mode occur?

Avoiding that consequence is the benefit of your maintenance.

The return on your investment.

And that is exactly how maintenance should be seen: as an investment. You incur a maintenance cost in return for a benefit in sustained safety and reliability. And as with all good investments, the benefit should outweigh the original investment.

So, understanding the consequences of failures is key to developing a good maintenance program. One with a good return on investment.

Just as not all failures have the same probability, not all failures have the same consequence.

Even if it relates to the same type of equipment.

Consider a leaking tank. The consequence of a leaking tank is severe if the tank contains a highly flammable liquid. But if the tank is full of potable water the consequence might not be of great concern.

Easy, right?

But what if the water is required for fire fighting?

The same tank, the same failure but now we might be more concerned. We would not want to end up in a scenario of not being able to fight a fire because we had an empty tank due to a leak.

Apart from the consequence of a failure you also need to think about the likelihood of the failure actually occurring.

Maintenance tasks should be developed for dominant failure modes only. Those failures that occur frequently and those that have serious consequences but are less frequent to rare. Avoid assigning maintenance to non-credible failure modes. And avoid analyzing non-credible failure modes. It eats up your scarce resources for no return.

A maintenance program should consider both the consequence and the likelihood of failures. And since Risk = Likelihood x Consequence we can conclude that good maintenance programs are risk-based.

Good maintenance programs use the concept of risk to assess where to use our scarce resources to get the greatest benefit. The biggest return on our investment.

Principle #4: Parts might wear out, but your equipment breaks down

A ‘part’ is usually a simple component, something that has relatively few failure modes. Some examples are the timing belt in a car, the roller bearing on a drive shaft, the cable on a crane.

Simple items often provide early signals of potential failure, if you know where to look. And so we can often design a task to detect potential failure early on and take action prior to failure.

For those simple items which do “wear out” there will be a strong increase in the probability of failure past a certain age. If we know the typical wear outage for a component, we can schedule a time-based task to replace it before failure.

When it comes to complex items made of many “simple” components, things are different.

All those simple components have their own failure modes with its own failure pattern. Because complex items have so many, varied failure modes, they typically do not exhibit wear outage. Their failures do not tend to be a function of age but occur randomly. Their probability of failure is generally constant as represented by curves E and F.

Most modern machinery consists of many components and should be treated as complex items. That means no clear wear outage. And without clear wear out age performing time-based overhauls is ineffective. And wasteful of our scarce resources.

Only where we can prove that an item has wear outage does performing time-based overhaul or component replacement make sense.

Principle #5: Hidden failures must be found

Hidden failures are failures that remain undetected during normal operation. They only become evident when you need the item to work (failure on demand). Or when you conduct a test to reveal the failure – a failure finding task.

Hidden failures are often associated with equipment with protective functions. Something like a high-high pressure trip. Protective functions like these are not normally active. They are only required to function by exception to protect your people from injury or death. To protect the environment from a major impact or protect our assets from major damage. This means we pretty much always conduct failure finding tasks on equipment with protective functions.

To be clear, a failure finding task does not prevent failure. Instead, a failure finding task does exactly what its name implies. It seeks to find a failure. A failure that has already happened, but has not been revealed to us. It has remained hidden.

We must find hidden failures and fix them before the equipment is required to operate.

Principle #6: Identical equipment does not mean identical maintenance

Just because two pieces of equipment are the same doesn’t mean they need the same maintenance. In fact, they may need completely different maintenance tasks.

The classic example is two exactly the same pumps in a duty – standby setup.8 Same manufacturer, same model. Both pumps process the exact same fluid under the same operating conditions. But Pump A is the duty pump, and Pump B is the standby. Pump A normally runs and Pump B is only used when Pump A fails.

When it comes to failure modes Pump B has an important hidden failure mode: it might not start on demand. In other words, when Pump A fails or under maintenance, you suddenly find that Pump B won’t start. Oops.

Pump B doesn’t normally run so you wouldn’t know it couldn’t start until you came to start it. That’s the classic definition of a hidden failure mode. And hidden failure modes like this require a failure finding task i.e. you go and test to see if Pump B will start. But you don’t need to do this for Pump A because it’s always running (unless when it’s off or failed).

So when building a maintenance program you must consider the operating context.

A difference in criticality can also lead to different maintenance needs. Safety or production critical equipment will need more monitoring and testing than the same equipment in low criticality service.

It’s important to reinforce that identical equipment may need different maintenance requirements. This is far too often forgotten or simply ignored for convenience. But you could find yourself facing critical failures by ignoring this basic concept. Especially if you use a library of preventive maintenance tasks.

Principle #7: “You can’t maintain your way to reliability”

I love this quote from Terrence O’Hanlon and it’s so very true. Maintenance can only preserve your equipment’s inherent design reliability and performance.

If the equipment’s inherent reliability or performance is poor, doing more maintenance will not help.

No amount of maintenance can raise the inherent reliability of a design.

To improve poor reliability or performance that’s due to poor design, you need to change the design. Simple.

When you encounter failures – defects – that relate to design issues you need to eliminate them.

Sure, the more proactive and more efficient approach is to ensure that the design is right, to begin with. But all plants startup with design defects. Even proactive plants. And that’s why the most reliable plants in the world have an effective defect elimination program in place.

Principle #8: Good maintenance programs don’t waste your resources

This seems obvious, right? But when we review PM programs we often find tasks that add no value. Tasks that waste resources and actually reduce reliability and availability.

It’s so common for people to say “whilst we do this, let’s also check this. It only takes 5 minutes.”

But 5 minutes here and there, every week or every month and we’ve suddenly wasted a lot of time. And potentially introduced a lot of defects that can impact equipment reliability down the line.

Another source of waste in our PM programs is trying to maintain a level of performance and functionality that we don’t actually need.

Equipment is often designed to do more that what it is required to do in its actual operating conditions. As maintainers, we should be very careful about maintaining to design capabilities. Instead, in most cases, we should maintain our equipment to deliver to operating requirements. Maintenance done to ensure equipment capacity greater than actually needed is a waste of resources.

Similarly, avoid assigning multiple tasks to a single failure mode. It’s wasteful and it makes it hard to determine which task is actually effective. Stick to the rule of a single, effective task per failure mode as much as you can. Only for very high consequence failure modes should you consider having multiple, diverse tasks to a single failure mode.

Most organisations have more maintenance to do than resources to do it with. Use resources on unnecessary maintenance, and you risk not completing necessary maintenance. And not completing necessary maintenance, or completing it late, increases the risk of failures.

And when that unnecessary maintenance is intrusive it gets worse. Experience shows that intrusive maintenance leads to increased failures because of human error. This could be simple mistakes. Or because of defective materials or parts, or errors in technical documentation.

A lot of maintenance is done with the equipment off-line. So doing unnecessary maintenance can also increase production losses.

So make sure you remove unnecessary maintenance from your system. Make sure you have a clear and legitimate reason for every task in your maintenance program. Make sure you link all tasks to a dominant failure mode. And have clear priorities for all maintenance tasks. That allows you to prioritise tasks. In the real world, we are all resource-constrained.

Principle #9: Good maintenance programs become better maintenance programs

The most effective maintenance programs are dynamic. They are changing and improving continuously. Always making better use of our scarce resources. Always becoming more effective at preventing those failures that matter to our business.

When improving your maintenance program you need to understand that not all improvements have the same leverage:

First, focus on eliminating unnecessary maintenance tasks. This eliminates the direct maintenance of labour and materials. But it also removes the effort required to plan, schedule, manage, and report on this work.

Second, change time-based overhaul or replacement tasks into condition-based tasks. Instead of replacing a component every so many hours, use a condition monitoring technique to assess how much life the component has left. And only replace the component when actually required.

And third, extend task intervals. Do this based on data analysis, operator and maintainer experience. Or simply on good engineering judgment. Remember to observe the results.

The shorter the current interval, the greater the impact when extending that interval. For example, adjusting a daily task to weekly reduces the required PM workload for that task by more than 80%.

This is often the simplest and one of the most effective improvements you can make.

Reliability Centered Maintenance references

I wrote this article based on a number of key sources listed below (and throughout the article). I strongly recommend getting yourself a copy of Moubray’s book if don’t already own a copy. And I’d definitely get the NAVSEA RCM manual as it’s well written and easy to understand:



Popular posts from this blog

Maintenance 4.0 Implementation Handbook (pdf)

WHAT IS MAINTENANCE 4.0? Industry 4.0 is a name given to the current trend of automation and data exchange in industrial technologies. It includes the Industrial Internet of things (IIoT), wireless sensors, cloud computing, artificial intelligence (AI) and machine learning. Industry 4.0 is commonly referred to as the fourth industrial revolution. Maintenance 4.0 is a machine-assisted digital version of all the things we have been doing for the past forty years as humans to ensure our assets deliver value for our organization. Maintenance 4.0 includes a holistic view of sources of data, ways to connect, ways to collect, ways to analyze and recommended actions to take in order to ensure asset function (reliability) and value (asset management) are digitally assisted. For example, traditional Maintenance 1.0 includes sending highly-trained specialists to collect machinery vibration analysis readings on pumps, motors and gearboxes. Maintenance 4.0 includes a wireless vibration sensor conne

27 steps of the Gearbox Repair and rebuilding

 27 steps of the Gearbox Repair and rebuilding: Step 1 Cleaning exterior of Gearbox and identification. Step 2 Remove all bolts from the gearbox. Step 3 Disassembly for Gearbox preliminary evaluation of the condition and repair required Step 4 Mag inspect Gearbox. Step 5 check all Gears. Step 6 Customer communication of health of the Gearbox. Step 7 Parts to be repaired or, reverse engineered parts where needed required for Gearbox rebuild. Step 8 Failure analysis during complete disassembly and evaluation of the component wear and damage. Step 9 Cleaning all internal components and housing. Step 10 Check all bearings diameters in house. Step 11 Check all shafts Step 12 inspect all Gears. Step 13 Set up check line bore of the gearbox. Step 14 Repair and rebuild Gears back to O.E.M Step 15 Replacing all bearings seals and gaskets Step 16 Repair and rebuild all shafts again to O.E.M Step 17 Realigning all gears shafts and bearings back to O.E.M Step

Thermal growth: how to identify, quantify and deal with its effects on turbomachinery

Thermal growth, as used in the field of machinery alignment, is machine frame expansion resulting from heat generation. The generation of heat, of course, is caused by operational processes and forces. Materials subjected to temperature changes from heat generation will expand by precise amounts defined by their material properties. In turbomachinery, thermal growth results from the temperature differences occurring between the at-rest and running conditions. Generally speaking, the greater the temperature difference, the greater the thermal growth. The magnitude of the growth can be calculated from three variables: ∆ T (temperature difference) C   (coefficient of thermal expansion) L    (distance between shaft centerline and machine supports) When machinery begins to generate heat, the temperature difference between at-rest and running conditions will cause thermal expansion of the machine frame, thereby bringing about the movement of the shaft centerlines. This can produce changes in

John Crane's Type 28 Dry Gas Seals: How Does It Work?

How Does It Work? Highest Pressure Non-Contacting, Dry-Running Gas Seal Type 28 compressor dry-running gas seals have been the industry standard since the early 1980s for gas-handling turbomachinery. Supported by John Crane's patented design features, these seals are non-contacting in operation. During dynamic operation, the mating ring/seat and primary ring/face maintain a sealing gap of approximately 0.0002 in./5 microns, thereby eliminating wear. These seals eliminate seal oil contamination and reduce maintenance costs and downtime. John Crane's highly engineered Type 28 series gas seals incorporate patented spiral-groove technology, which provides the most efficient method for lifting and maintaining separation of seal faces during dynamic operation. Grooves on one side of the seal face direct gas inward toward a non-grooved portion of the face. The gas flowing across the face generates a pressure that maintains a minute gap between the faces, optimizing flui

Technical questions with answers on gas turbines

By NTS. What is a gas turbine? A gas turbine is an engine that converts the energy from a flow of gas into mechanical energy. How does a gas turbine work? Gas turbines work on the Brayton cycle, which involves compressing air, mixing it with fuel, and igniting the mixture to create a high-temperature, high-pressure gas. This gas expands through a turbine, which generates mechanical energy that can be used to power a variety of machines and equipment. What are the different types of gas turbines? There are three main types of gas turbines: aeroderivative , industrial, and heavy-duty. Aeroderivative gas turbines are used in aviation and small-scale power generation. Industrial gas turbines are used in power generation and other industrial applications. Heavy-duty gas turbines are typically used in large power plants. What are the main components of a gas turbine? The main components of a gas turbine include the compressor, combustion chamb