Checking your Secondary’s integrity

Suppose you run into the following situation: you’ve suffered the effects of a nasty bug hidden somewhere deep inside your network stack. Say your NIC screwed up checksum offloading, or some network driver caused a kernel panic. Either way, you have reason to believe that some portion of the data you sent over that wire can’t be trusted. Suppose further that you ran DRBD over the affected network link.

How can you check the integrity of the replicated data without taking down your Primary and thus avoiding a service interruption? Here’s how.

With DRBD+, and DRBD post 8.2.3

With DRBD+ and DRBD in versions 8.2.3 and above, you simply issue drbdadm verify resource. No service interruption, no loss of redundancy in the process, no fiddling with the Secondary at all. See this post for details.

With DRBD 0.7

If you’re still on DRBD 0.7, your course of action is this:

  1. Log on to your current Secondary. Stop all Heartbeat services, this is to ensure that Heartbeat doesn’t fail over while you are conducting the tests to follow. Henceforth, I’ll refer to the node where Heartbeat is shut down, and where you’ll be doing the steps described here, as the offline node. The one where your service continues to run is, expectedly, the online node.
  2. Check /proc/drbd to make sure that all of your DRBDs are now in Secondary mode.
  3. Run drbdadm disconnect resource, where resource is the name of the DRBD resource whose integrity you are about to investigate.
  4. Check /proc/drbd again; that resource should now show cs:StandAlone as its connection state.
  5. Make the resource Primary. DRBD will allow this due to the resource’s disconnected state: drbdadm primary resource
  6. Now you can run your integrity check. Typically this would involve something like doing fsck /dev/drbdnum followed by fsck -f /dev/drbdnum to check the integrity of the file system configured on the device. You may also mount the device and run some application-specific tests.
  7. When you have completed checking, stop all applications using the resource. Unmount the device.
  8. Make your resource Secondary again on the offline node: drbdadm secondary resource
  9. Double-check /proc/drbd again, you absolutely want to make sure that the device is Secondary.
  10. Repeat. 🙂
  11. Now you can connect the resource again: drbdadm connect resource
  12. DRBD will now recover from the split brain you deliberately created. You probably won’t notice except if you’re watching your kernel log, but that’s what happens behind the scenes. This behavior was changed for a reason in DRBD 8, by the way, see below for details.
  13. In /proc/drbd, you should now be able to observe that the device changes its connection state to SyncTarget, and that it is synchronizing with the online node again. Any changes you made on the offline node while it was Primary are discarded in the process, and any changes made to the online node during the connection interruption propagate.
  14. When everything is in sync, the connection status will change to Connected. Your online node will still be Primary, and the node had taken offline will be Secondary as before.

If your checks on the offline node detected any suspicious inconsistencies, you should now issue drbdadm invalidate resource on your Secondary, which will force a full sync of resource from the Primary (which has the reliable data). If you detected no inconsistencies, you have successfully validated your data and may retire for a coffee break.

Needless to say, while you are conducting this integrity check, your service is temporarily not redundant. You may need to inform your boss (or customer) of that fact and seek their approval. Temporarily not being redundant still beats the heck out of temporarily being out of service, I might add.

With DRBD 8 (pre-8.2.3)

In DRBD 8, automatic split brain recovery is disabled by default. When your DRBD cluster detects split brain, DRBD will disconnect. This is a deliberate discontinuity from DRBD 0.7 that many users asked for — after split brain, most people want manual control over the recovery. To manually restore connectivity after your deliberately-induced split brain, do:

  • on the offline node: drbdadm -- --discard-my-data connect resource
  • on the online node: drbdadm connect resource

This replaces step 12 above. After this, DRBD will reconnect and selectively copy those blocks that differ between both nodes, making sure the offline node‘s data is consistent with the online node again.

If you really really really want to (hint: you don’t), you can emulate DRBD 0.7’s split brain recovery behavior. To do that, add the following lines to the net section in your resource configuration:

net {
    after-sb-0pri discard-younger-primary;
    after-sb-1pri consensus;
    after-sb-2pri disconnect;

Did I mention you don’t want to do this? Anyway, you don’t want to do this. You run into split brain, you want to know. And fix it manually, really.Other options are available; make sure you understand their implications. Check your drbd.conf man page for details.

One Response to Checking your Secondary’s integrity

  1. jefe78 says:

    I don’t ever do this and I realize its an old post but I wanted to say thank you. None of the documentation I was looking at covered what you did. You solved what I’d been trying to fix for several hours. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: