Excuse that deliberately Google-optimized blunt and inelegant title, folks, but this is getting old. If you run dual-Primary DRBD, and then export an iSCSI target from both nodes, and then you want to to do dm-multipath or somesuch for what you think constitutes failover, don’t do that. There. Bold and italics. Really and truly, don’t.
DRBD core developer Lars Ellenberg has made yet another attempt at explaining why this won’t work in a post to the drbd-user mailing list last night. Here’s an excerpt:
“Dual-primary” iSCSI targets for multipath: does not work. iSCSI is a stateful protocol, there is more to it that than just reads and writes. To run multipath (or multi-connections per session) against distinct targets on separate nodes you’d need to have cluster aware iSCSI targets which coordinate with each other in some fashion. To my knowledge, this does not exist (not for Linux, anyways).
(Emphasis in original, I merely reformatted for HTML.)
He goes on to explain that these distinct targets just
happen to live on top of data that, due to replication, happens to be the same, most of the time, unless the replication link was lost for whatever reason; in which case you absolutely want to make sure that at least one box reboots hard before it even thinks about completing or even submitting an other IO request…
Please, folks, listen to Lars. This is all very much in line with the “DRBD doesn’t do magic” note we’ve had in the User’s Guide for several years now:
DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers above it. Thus, it is impossible for DRBD to miraculously add features to upper layers that these do not possess.
So please, if you’re seeing Concurrent local write detected
or the tell-all DRBD is not a random data generator!
message in your logs, don’t come complaining. And even if you don’t see them yet, you will, eventually.
When you think you must run on dual-Primary DRBD, then run a cluster aware service on top of it. Such as cLVM, or OCFS2, or GFS2, or even live migration capable KVM under Pacemaker management with fencing, but not iSCSI. Unless you write a cluster aware iSCSI target.
Until then, run iSCSI on single-Primary DRBD under Pacemaker management. Configure your iSCSI initiator properly so it does not throw I/O errors on failover. If you don’t know how, ask someone who does. Find us on IRC, or on the mailing lists; we’re happy to help. Or give us a call.
one option you come close to talking about is dual primary with clvm and then run pacemaker controlled iscsi on top of that. it seems like that will avoid the problems you’re talking about. it also seems a little bit easier to think about, but that may just be me. with a dedicated ip for each iscsi target you could even split the load between the heads if that seemed to make sense.
is this a valid approach to things?
No, in most cases that’s not a valid approach to things. Please re-read Lars’ original mailing list comment; adding cLVM to the mix doesn’t make iSCSI any less stateful as a protocol.
SCST iSCSI seems to handle this Ok.
http://sourceforge.net/apps/mediawiki/scst/index.php?title=SCST,_DRBD_and_Dual_Primary_Mode
It’s a fair point to make that if they’re documenting this themselves, it should probably work. I will concede that of the 4 iSCSI implementations for Linux, SCST is the one I’m least familiar with, so I’ll take their word for it.
What has me raise an eyebrow in that wiki page, though, is the passing remark that SCST seems to not store its PR information persistently on the exported device, thus PRs would be “forgotten” on failover in an active/passive cluster, and would only be effective on one of the nodes in an active/active one. That gives me the creeps a bit.
I guessing open-e is using drdb in their solution and that is beeing marketed as an active-actice solution.
I can’t speak for open-e, but do note that active-active does not necessarily imply dual-Primary DRBD. You can have two separate DRBD resources, one of which is Primary on each node, and thus build an active-active solution that makes no use of dual-Primary at all.
That actually seems to be what Open-E’s doing !
thanks for sharing this information.
so finally, what do you mean is a good working solution running a failover storage which exports via iscsi with clvm on top?
if using primary/secondary setup on drbd with “failover” activated by pacemaker or something, the iscsi session will get lost, wouldn’t it? is there any “cluster-alternative” to iscsi? aoe or something?
what do you prefer?
What I prefer, today? Ceph RBD.
It exists. Linux based. Neither OpenSource, nor free – but AFAIK SvSAN from StorMagic exactly does that.
You do realize that you made that comment on a blog post that is almost 6 years old.
never mind, SvSAN existed at that time and so the comment would still be valid if we were 6 yrs in the past.. 😉