Dual-primary DRBD, iSCSI, and multipath: Don’t Do That!

Excuse that deliberately Google-optimized blunt and inelegant title, folks, but this is getting old. If you run dual-Primary DRBD, and then export an iSCSI target from both nodes, and then you want to to do dm-multipath or somesuch for what you think constitutes failover, don’t do that. There. Bold and italics. Really and truly, don’t.

DRBD core developer Lars Ellenberg has made yet another attempt at explaining why this won’t work in a post to the drbd-user mailing list last night. Here’s an excerpt:

“Dual-primary” iSCSI targets for multipath: does not work. iSCSI is a stateful protocol, there is more to it that than just reads and writes. To run multipath (or multi-connections per session) against distinct targets on separate nodes you’d need to have cluster aware iSCSI targets which coordinate with each other in some fashion. To my knowledge, this does not exist (not for Linux, anyways).

(Emphasis in original, I merely reformatted for HTML.)

He goes on to explain that these distinct targets just

happen to live on top of data that, due to replication, happens to be the same, most of the time, unless the replication link was lost for whatever reason; in which case you absolutely want to make sure that at least one box reboots hard before it even thinks about completing or even submitting an other IO request…

Please, folks, listen to Lars. This is all very much in line with the “DRBD doesn’t do magic” note we’ve had in the User’s Guide for several years now:

DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers above it. Thus, it is impossible for DRBD to miraculously add features to upper layers that these do not possess.

So please, if you’re seeing Concurrent local write detected or the tell-all DRBD is not a random data generator! message in your logs, don’t come complaining. And even if you don’t see them yet, you will, eventually.

When you think you must run on dual-Primary DRBD, then run a cluster aware service on top of it. Such as cLVM, or OCFS2, or GFS2, or even live migration capable KVM under Pacemaker management with fencing, but not iSCSI. Unless you write a cluster aware iSCSI target.

Until then, run iSCSI on single-Primary DRBD under Pacemaker management. Configure your iSCSI initiator properly so it does not throw I/O errors on failover. If you don’t know how, ask someone who does. Find us on IRC, or on the mailing lists; we’re happy to help. Or give us a call.

12 Responses to Dual-primary DRBD, iSCSI, and multipath: Don’t Do That!

  1. Joe Pruett says:

    one option you come close to talking about is dual primary with clvm and then run pacemaker controlled iscsi on top of that. it seems like that will avoid the problems you’re talking about. it also seems a little bit easier to think about, but that may just be me. with a dedicated ip for each iscsi target you could even split the load between the heads if that seemed to make sense.

    is this a valid approach to things?

    • Florian Haas says:

      No, in most cases that’s not a valid approach to things. Please re-read Lars’ original mailing list comment; adding cLVM to the mix doesn’t make iSCSI any less stateful as a protocol.

    • Florian Haas says:

      It’s a fair point to make that if they’re documenting this themselves, it should probably work. I will concede that of the 4 iSCSI implementations for Linux, SCST is the one I’m least familiar with, so I’ll take their word for it.

      What has me raise an eyebrow in that wiki page, though, is the passing remark that SCST seems to not store its PR information persistently on the exported device, thus PRs would be “forgotten” on failover in an active/passive cluster, and would only be effective on one of the nodes in an active/active one. That gives me the creeps a bit.

  2. Mikael says:

    I guessing open-e is using drdb in their solution and that is beeing marketed as an active-actice solution.

    • Florian Haas says:

      I can’t speak for open-e, but do note that active-active does not necessarily imply dual-Primary DRBD. You can have two separate DRBD resources, one of which is Primary on each node, and thus build an active-active solution that makes no use of dual-Primary at all.

  3. Patrick says:

    thanks for sharing this information.
    so finally, what do you mean is a good working solution running a failover storage which exports via iscsi with clvm on top?
    if using primary/secondary setup on drbd with “failover” activated by pacemaker or something, the iscsi session will get lost, wouldn’t it? is there any “cluster-alternative” to iscsi? aoe or something?
    what do you prefer?

  4. Roland says:

    It exists. Linux based. Neither OpenSource, nor free – but AFAIK SvSAN from StorMagic exactly does that.

Leave a reply to Joe Pruett Cancel reply