Dual-primary DRBD, iSCSI, and multipath: Don’t Do That!

Excuse that deliberately Google-optimized blunt and inelegant title, folks, but this is getting old. If you run dual-Primary DRBD, and then export an iSCSI target from both nodes, and then you want to to do dm-multipath or somesuch for what you think constitutes failover, don’t do that. There. Bold and italics. Really and truly, don’t.

DRBD core developer Lars Ellenberg has made yet another attempt at explaining why this won’t work in a post to the drbd-user mailing list last night. Here’s an excerpt:

“Dual-primary” iSCSI targets for multipath: does not work. iSCSI is a stateful protocol, there is more to it that than just reads and writes. To run multipath (or multi-connections per session) against distinct targets on separate nodes you’d need to have cluster aware iSCSI targets which coordinate with each other in some fashion. To my knowledge, this does not exist (not for Linux, anyways).

(Emphasis in original, I merely reformatted for HTML.)

He goes on to explain that these distinct targets just

happen to live on top of data that, due to replication, happens to be the same, most of the time, unless the replication link was lost for whatever reason; in which case you absolutely want to make sure that at least one box reboots hard before it even thinks about completing or even submitting an other IO request…

Please, folks, listen to Lars. This is all very much in line with the “DRBD doesn’t do magic” note we’ve had in the User’s Guide for several years now:

DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers above it. Thus, it is impossible for DRBD to miraculously add features to upper layers that these do not possess.

So please, if you’re seeing Concurrent local write detected or the tell-all DRBD is not a random data generator! message in your logs, don’t come complaining. And even if you don’t see them yet, you will, eventually.

When you think you must run on dual-Primary DRBD, then run a cluster aware service on top of it. Such as cLVM, or OCFS2, or GFS2, or even live migration capable KVM under Pacemaker management with fencing, but not iSCSI. Unless you write a cluster aware iSCSI target.

Until then, run iSCSI on single-Primary DRBD under Pacemaker management. Configure your iSCSI initiator properly so it does not throw I/O errors on failover. If you don’t know how, ask someone who does. Find us on IRC, or on the mailing lists; we’re happy to help. Or give us a call.

This entry was posted on Tuesday, November 29th, 2011 at 22:03 and is filed under Rant, Storage. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

12 Responses to Dual-primary DRBD, iSCSI, and multipath: Don’t Do That!

Joe Pruett says:

April 11, 2012 at 6:45

one option you come close to talking about is dual primary with clvm and then run pacemaker controlled iscsi on top of that. it seems like that will avoid the problems you’re talking about. it also seems a little bit easier to think about, but that may just be me. with a dedicated ip for each iscsi target you could even split the load between the heads if that seemed to make sense.

is this a valid approach to things?

Reply
- Florian Haas says:
  
  April 30, 2012 at 11:20
  
  No, in most cases that’s not a valid approach to things. Please re-read Lars’ original mailing list comment; adding cLVM to the mix doesn’t make iSCSI any less stateful as a protocol.
  
  Reply
Ian Macintosh says:

July 24, 2012 at 18:30

SCST iSCSI seems to handle this Ok.

http://sourceforge.net/apps/mediawiki/scst/index.php?title=SCST,_DRBD_and_Dual_Primary_Mode

Reply
- Florian Haas says:
  
  July 24, 2012 at 19:34
  
  It’s a fair point to make that if they’re documenting this themselves, it should probably work. I will concede that of the 4 iSCSI implementations for Linux, SCST is the one I’m least familiar with, so I’ll take their word for it.
  
  What has me raise an eyebrow in that wiki page, though, is the passing remark that SCST seems to not store its PR information persistently on the exported device, thus PRs would be “forgotten” on failover in an active/passive cluster, and would only be effective on one of the nodes in an active/active one. That gives me the creeps a bit.
  
  Reply
Mikael says:

September 16, 2012 at 20:22

I guessing open-e is using drdb in their solution and that is beeing marketed as an active-actice solution.

Reply
- Florian Haas says:
  
  September 17, 2012 at 13:38
  
  I can’t speak for open-e, but do note that active-active does not necessarily imply dual-Primary DRBD. You can have two separate DRBD resources, one of which is Primary on each node, and thus build an active-active solution that makes no use of dual-Primary at all.
  
  Reply
  - silopolis says:
    
    July 15, 2013 at 14:02
    
    That actually seems to be what Open-E’s doing !
    
    Reply
Patrick says:

August 29, 2013 at 14:00

thanks for sharing this information.
so finally, what do you mean is a good working solution running a failover storage which exports via iscsi with clvm on top?
if using primary/secondary setup on drbd with “failover” activated by pacemaker or something, the iscsi session will get lost, wouldn’t it? is there any “cluster-alternative” to iscsi? aoe or something?
what do you prefer?

Reply
- Florian Haas says:
  
  August 29, 2013 at 14:21
  
  What I prefer, today? Ceph RBD.
  
  Reply
Roland says:

August 11, 2017 at 19:28

It exists. Linux based. Neither OpenSource, nor free – but AFAIK SvSAN from StorMagic exactly does that.

Reply
- Florian Haas says:
  
  August 11, 2017 at 21:27
  
  You do realize that you made that comment on a blog post that is almost 6 years old.
  
  Reply
  - Roland says:
    
    August 15, 2017 at 11:22
    
    never mind, SvSAN existed at that time and so the comment would still be valid if we were 6 yrs in the past.. 😉
    
    Reply

Florian's blog