VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely

Tuesday, March 3, 2009

VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely
KB Article 1008130
Updated Jan. 30, 2009
Products
VMware ESX
Product Versions
VMware ESX 3.5.x
VMware ESXi 3.5.x Embedded
VMware ESXi 3.5.x Installable
Symptoms

One or more of the following may be present:

  • VMware ESX or ESXi host might get disconnected from VirtualCenter.
     
  • All paths to the LUNs are in standby state.
     
  • esxcfg-rescan might take a long time to complete or never completes (hung).
     
  • Error messages matching this pattern are repeated continually in vmkernel:
    <date and time> <hostname> vmkernel: <server uptime> cpu6:1177)SCSI: 675: Queue for device vml.<Vol. Dev. ID> has been blocked for 7 seconds.
    <date and time> <hostname> vmkernel: <server uptime> cpu7:1184)SCSI: 675: Queue for device vml.<Vol. Dev. ID> has been blocked for 6399 seconds.
     
    If you look at log entries previous to the first blocked message, you will see storage events and a failover attempt.
    Example:
    <date and time> <hostname> vmkernel: 31:19:32:26.199 cpu3:3824)Fil3: 5004:  READ error 0xbad00e5
    <date and time> <hostname> vmkernel: 31:19:32:29.224 cpu1:3961)StorageMonitor: 196: vmhba0:0:0:0 status = 0/5 0x0 0x0 0x0
    <date and time> <hostname> vmkernel: 31:19:32:29.382 cpu2:1144)FS3: 5034: Waiting for timed-out heartbeat [HB state abcdef02 offset 3736576 gen 26 stamp 2748610023852 uuid 4939b0cf-c85aa695-158d-00144f021dd4 jrnl <FB 383397> drv 4.31]
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)<6>qla2xxx_eh_device_reset(1): device reset failed
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4279: Reset during HBA failover on vmhba1:2:1 returns Failure
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 3746: Could not switchover to vmhba1:2:1. Check Unit Ready Command returned an error instead of NOT READY for standby controller .
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4622: Manual switchover to vmhba1:2:1 completed unsuccessfully.
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)StorageMonitor: 196: vmhba0:2:1:0 status = 0/1 0x0 0x0 0x0
    <date and time> <hostname> vmkernel: 31:19:32:29.640 cpu2:1067)scsi(1): Waiting for LIP to complete...
    <date and time> <hostname> vmkernel: 31:19:32:29.640 cpu2:1067)<6>qla2x00_fw_ready ha_dev_f=0xc
    <date and time> <hostname> vmkernel: 31:19:32:30.532 cpu2:1026)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    <date and time> <hostname> last message repeated 31 times
    <date and time> <hostname> vmkernel: 31:19:32:31.535 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x81
    <date and time> <hostname> vmkernel: 31:19:32:31.541 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x82
    <date and time> <hostname> vmkernel: 31:19:32:31.547 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x83
    <date and time> <hostname> vmkernel: 31:19:32:31.568 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x84
    <date and time> <hostname> vmkernel: 31:19:32:31.573 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x85
    <date and time> <hostname> vmkernel: 31:19:32:31.576 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x86
    <date and time> <hostname> vmkernel: 31:19:32:32.531 cpu2:4267)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    <date and time> <hostname> last message repeated 31 times
    <date and time> <hostname> vmkernel: 31:19:32:32.532 cpu1:3973)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x6 0x29 0x0


Resolution
This issue can occur on VMware ESX servers under the following conditions:
  • Hypervisor version: VMware ESX 3.5 U3.
  • SAN hardware: Active/Passive and Active/Active arrays (Fibre Channel and iSCSI).
  • Trigger: This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides .
A reboot is required to clear this condition.
 
Note: This issue is resolved in the following patches, released Jan. 30, 2009:

 
 
 

Popular Posts