Error in vxdg configuration copies

A colleague and I experienced a pretty non-trivial issue with Veritas Volume Manager (VxVM) today. The host was running Red Hat Enterprise Linux 5.8 (an HP ProLiant DL380 G7, hooked up to some IBM XIV SAN storage via fibre channel).

It was a pretty old host and had been up for about 600 days. vxdisk -eo alldgs list was reporting that one of my standard volumes was actually a snapshot!

What the heck? Had someone added a read/write snapshot to the diskgroup and extended onto it? Surely not!

I checked the array. Both disks were standard volumes. Hmm. And whew!

My next thought went to VxVM. The host had been up a while. Chances were that the storage configuration had changed quite frequently without sysadmins performing the correct disk removal / reassignment commands typically associated with that sort of work.

My next thought was, “Let’s restart VxVM! That’s impactless, right?”. Freezing my cluster service groups first, I performed the following:

Uhoh. That doesn’t look too hot — vxconfigd wouldn’t come back. Not a lot of info as to why here. Let’s check /var/log/messages…

Good ol’ Veritas data corruption protection. Somewhere along the line, one of the LUNs presented to this system had been unpresented and then represented with a new LUN ID, causing all kinds of havoc.

It’s always a pain to resolve, but I’m in a particular pickle here, as VxVM wants me to run vxdisk rm to remove the badly behaved multipath devices from its configuration. I can’t issue this command until vxconfigd is running and in enabled mode, but I also can’t get vxconfigd running until I do this! A true chicken and egg scenario…

Looking more closely at the log output, only three block devices were causing the problems: sdcy, sdcn and sdy. What disk group do they belong to?

a quick egrep of my disk group backups in /etc/vx/cbr/bk (caution with these; they can be out of date depending on how often administrative disk operations take place and how vxconfigbackupd is configured) showed that they were members of the disk group that had been causing me problems: the one with the standard volume and apparent ‘snapshot’.

I knew for a fact that:

No I/O was passing through these disks (No filesystems were mounted)
Disks in this disk group had six redundant paths to storage each. This meant that each SAN volume had six block devices that the OS could use to route I/O requests.
I only wanted to get rid of three block devices.

Therefore, I was relatively happy performing the following:

After that, I gingerly started VxVM again and tried to run vxdisk -eo alldgs list again…

It’s back! And VxVM agrees that our ‘snapshot’ is now a standard volume! If we compared the output of vxdisk print xiv0_001 from before and after our action, we’d probably see that the names of the block devices that represented our storage devices had changed slightly, too.

Clearly something had been done at the storage layer with this particular disk. Exactly what will probably always remain a mystery. But if nothing else, this proves that performing sanity reboots before starting a migration is a pretty good idea.