How to deal with kernel panics and oopses in HA clusters

Let’s face it. Software is never perfect. Even the most reliable of systems do produce kernel panics and oopses. It shouldn’t happen, but it does. As an admin responsible for the operation of a high availability cluster, you can keep your cluster service up and running on a healthy node — even if one of your cluster nodes runs into a panic or oops. These are hints about some simple measures that help you do just that.

Note: what I describe in this post is a measure that reduces system down time in the event of a panic or oops, and may prevent data corruption or otherwise erratic behavior in such a situation. It does not absolve you of your job to fix the cause of the panic or oops (which you should do as quickly as possible in any event).

The tricky part about oopses and panics is that they make your system completely unreliable. The system may still appear to be up from a network perspective — it may respond to pings, it may even continue to issue its Heartbeat multicast packets, thus leading the peer node to believe that everything is quite all right. So, the cluster manager has no reason to oust that faulty node from the cluster, or even just take over some of its resources. But the node is a ticking time bomb at this time.Thus, you need to make sure that a cluster node, when it runs into such a situation, will forcibly remove itself from the cluster. You can think of this as a type of self-fencing. You can achieve this functionality by making two simple entries in your sysctl config file, which typically lives at /etc/sysctl.conf:

kernel.panic_on_oops = 1
kernel.panic = 1

You enable these settings by issuing sysctl -p as root. What they do for you is this:

  • kernel.panic = 1 initiates a hard system reboot whenever the system runs into a kernel panic.
  • kernel.panic_on_oops = 1 extends that behavior to oopses (the kernel will then treat any oops just like it treats a panic)

Thus, as soon as your system runs into an oops or panic state that renders it unreliable, it will immediately reboot. At that point, your second Heartbeat node will detect that the node is down, and your healthy node will take over all cluster resources.

I repeat: this configuration does not let you off the hook in in terms of fixing the root cause of the panic or oops. You need to treat any such occurrence as a grave issue that requires a permanent fix.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: