  | |  | lpfc RAID1 device panics when one device goes away | lpfc RAID1 device panics when one device goes away 2004-02-04 - By Bruen, Mark
Back The configuration I 'm using is:
LUN A -- Storage Processor A -- Fibre Channel Switch A -- HBA A -- /dev/sdc
LUN B -- Storage Processor B -- Fibre Channel Switch B -- HBA B -- /dev/sde
Two complete SCSI paths, one to each LUN (Disk). Regardless of which component
fails in either path the system "sees " the SCSI disk as failed and continues
I/O to the other device specified in the /etc/raidtab and/or /etc/mdadm.conf.
I seem to have fixed the panic problem by adjusting some of the lpfc driver
tunables to reduce the number of outstanding I/O requests and zero the delay
to report errors upward to the SCSI driver. I 'm convinced that what was
happening was that when I disabled one switch port to simulate a path failure
the timeout and/or number of outstanding I/O requests eventually caused a
flood of failures to the SCSI driver which overflowed somewhere causing the
panic. I 'm still working on the tuning to make sure I haven 't inadvertently
caused poor I/O performance while fixing the problem.
-Mark
Hamilton Andrew wrote:
> Are we talking about a failure of one of the HBA 's or a failure of a
> drive? I thought we were talking about the HBA failing which is far
> different than a drive failing.
>
> I agree with the LUNS part. That is exactly the way I see it as well.
> However in your case you have 2 connections to the 2 LUNs. In a locally
> attached you have, typically, one scsi connection to the raid array, not
> two. Granted, I would think that it wouldn 't work any different from
> the raid point of view if you had 1 SCSI connection or 2, but I assume
> that if you were using two connections to run your array and one of them
> went down how would the system know how to handle a lost SCSI card?
> Panic in my experience. I know there are hardware raid solutions that
> will fail over if one of the raid controller fails. I also know that
> there are software solutions. But I think you have to have an external
> software piece to do it. The kernel/OS isn 't going to know by default
> how to "fail over " the connection. If you had a drive fail that 's
> different. The software raid knows how to handle a drive failure.
> Handling a drive failure is fairly standard. If you have a backup then
> it just moves to that. But I think handling a SCSI failure would be
> very configuration dependent and would under normal circumstances cause
> a panic. Unless you had something intervening to catch those kinds of
> failures.
>
> I have 1 internal SCSI controller, 1 SCSI card, and 1 HBA. All of them
> act as SCSI controllers. I 'm also running software raid on the local
> drives and have a raid on the SAN. If I had a SCSI card failure and the
> SCSI card and the local SCSI controller were talking to the local raid,
> how would the machine react? I wouldn 't even know how to tell it to
> fail over to the SAN raid or use the other SCSI card to talk to the raid
> and ignore the failure without some sort of intervening software.
>
> Drew
>
> -- --Original Message-- --
> From: Bruen, Mark [mailto:mbruen@(protected)]
> Sent: Friday, January 30, 2004 9:52 AM
> To: redhat-list@(protected)
> Subject: Re: lpfc RAID1 device panics when one device goes away
>
>
> Actually I view the configuration as identical to having two locally
> attached SCSI disks which are mirrored via software RAID1. The only
> difference being the two "drives " (LUNs) are located on a storage array
> on a SAN. As far as the OS is concerned the two LUNs are just two
> separate SCSI drives. I 'm speculating that the lpfc driver does not
> handle or requires tuning parameters to be set to return the failed path
> information back up to the SCSI driver in a manner which won 't cause a
> panic.
> -Mark
>
> Hamilton Andrew wrote:
> > Mark,
> >
> > I may be wrong here and maybe someone out there knows better, but I
> > don 't think this will work without PowerPath. That allows your OS to
> > treat both your HBA 's as one. And it load balances across the two
> > HBA 's. Without that you have two independent connections to two LUNs
> > and that is what is causing the panic. You need something that will
> > treat both your connections as one connection. Even if both your HBA 's
> > can talk to both LUNs the OS is not going to fail over to the one that
> > is working without some sort of go-between, and the kernel does not know
> > it can talk to both LUNs via either HBA. It just knows that it had 2
> > connections to the raid and one of them is gone so the raid is no longer
> > available. At least that is the way it would seem to work to me.
> >
> > My 2 cents. Let me know if you find out something different though.
> >
> > Drew
> >
> > -- --Original Message-- --
> > From: Bruen, Mark [mailto:mbruen@(protected)]
> > Sent: Friday, January 30, 2004 8:54 AM
> > To: redhat-list@(protected)
> > Subject: Re: lpfc RAID1 device panics when one device goes away
> >
> >
> > No, it worked once but then on the next test panic 'd again, I 'll keep
> > looking.
> > -Mark
> >
> > Hamilton Andrew wrote:
> > > Did that fix it? I have an EMC CX600 configured much the same
> way, but
> > > I 'm using RHEL 2.1AS instead of 3.0. I 'm sure there are a ton of
> > > differences between the two distro 's.
> > >
> > > -- --Original Message-- --
> > > From: Bruen, Mark [mailto:mbruen@(protected)]
> > > Sent: Wednesday, January 28, 2004 7:09 PM
> > > To: redhat-list@(protected)
> > > Subject: Re: lpfc RAID1 device panics when one device goes away
> > >
> > >
> > > I think I have fixed this by changing the partition type of each
> LUN 's
> > > (disk)
> > > partition to "fd " (Linux raid auto).
> > >
> > > Bruen, Mark wrote:
> > > > That will be the config once Veritas and/or EMC support HBA path
> > > > failover on RedHat AS 3.0. Veritas will support it with DMP in
> > version 4
> > > > due in Q2/04, EMC has not committed to a date yet with PowerPath.
> > In the
> > > > interim I 'm trying to provide path failover using software RAID1
> > of two
> > > > hardware RAID5 LUNs one on each path (two switches connected to
> two
> > > > storage processors connected to two HBAs per server).
> > > > -Mark
> > > >
> > > > Hamilton Andrew wrote:
> > > >
> > > > > What 's your SAN? Why don 't you configure your raid1 on the
> SAN and
> > > > > let it publish that raid group as 1 LUN? Are you using a any
> > kind of
> > > > > fibre switch between your cards and your SAN?
> > > > >
> > > > > Drew
> > > > >
> > > > > -- --Original Message-- --
> > > > > From: Bruen, Mark [mailto:mbruen@(protected)]
> > > > > Sent: Wednesday, January 28, 2004 3:28 PM
> > > > > To: redhat-list@(protected)
> > > > > Subject: lpfc RAID1 device panics when one device goes away
> > > > >
> > > > >
> > > > > I 'm running RedHat AS 3.0 kernel 2.4.21-4.ELsmp on a Dell 1750
> > with 2
> > > > > Emulex
> > > > > LP9002DC-E HBAs. I 've configured a RAID1 device called /dev/md10
> > from
> > > > > 2 SAN
> > > > > based LUNs /dev/sdc and /dev/sde. Everything works fine until I
> > > > > disable one of
> > > > > the HBA paths to the disk. Here 's the console output:
> > > > > [root@(protected) root]# !lpfc1:1306:LKe:Link Down Event received
> > Data: x2
> > > > > x2 x0 x20
> > > > > I/O error: dev 08:40, sector 69792
> > > > > raid1: Disk failure on sde, disabling device.
> > > > > Operation continuing on 1 devices
> > > > > md10: vno@ pspar2e! d?i@
> > > > > s@(protected) tAo rec@(protected)`rIu/Oc
> > > > > t
> AaqArra@(protected)!@
> > > > > -v-@ cpont
> > > > > inI/uOinhgr oihn de_g_r_a_m@(protected)@`@ 70288
> > > > > I/O error: dev 08`I/O sector 70536
> > > > > I/O error: dev 08:40, sector 70784
> > > > > I/O error: dev 08:40, sector 71032
> > > > > I/O error: dev 08:40, sector 71280
> > > > > I/O error@(protected)@(protected)@(protected)!?@
> > > > > AqA@(protected)`I/O
> > > > > BqA@(protected)@(protected)@(protected) I/Oh 7h____mv@`dev
> 08:40,
> > > > > sector 72024
> > > > > `I/Oerror: dev 08:40, sector 72272
> > > > > I/O error: dev 08:40, sector 72520
> > > > > I/O error: dev 08:40, sector 72768
> > > > > I/O error: dev 08:40, sector 73@(protected)@(protected)@(protected)!?@
> > > > > BqA@(protected)`I/O
> > > > > CqA@(protected)@(protected)@(protected)
> > > > > I/Ohdeh____mv@`2
> > > > > I/O error: dev 08:40, `I/Oor 73760
> > > > > I/O error: dev 08:40, sector 74008
> > > > > I/O error: dev 08:40, sector 74256
> > > > > I/O error: dev 08:40, sector 74504
> > > > > I/O error: dev@(protected)@(protected)@(protected)!?@
> > > > > CqA@(protected)`I/O
> > > > > DqA@(protected)@(protected)@(protected) I/Oh0
> > > > > h____mv@`8:40, sector 75248
> > > > > I/O e`I/O: dev 08:40, sector 75496
> > > > > I/O error: dev 08:40, sector 75744
> > > > > I/O error: dev 08:40, sector 75992
> > > > > I/O error: dev 08:40, sector 76240
> > > > > <@(protected)@(protected)@(protected)!?@
> > > > > DqA@(protected)`I/O
> > > > > EqA@(protected)@(protected)@(protected) I/Oh8:h____mv@` I/O error: dev
> > 08:40,
> > > > > secto`I/O984
> > > > > I/O error: dev 08:40, sector 77232
> > > > > I/O error: dev 08:40, sector 77480
> > > > > I/O error: dev 08:40, sector 77728
> > > > > I/O error: dev 08:4@(protected)@(protected)@(protected)!?@
> > > > > EqA@(protected)`I/O
> > > > > FqA@(protected)@(protected)@(protected) I/Oh
> Ih____mv@`
> > > > > sector 78352
> > > > > I/O error:`I/O 08:40, sector 78600
> > > > > I/O error: dev 08:40, sector 78848
> > > > > I/O error: dev 08:40, sector 79096
> > > > > I/O error: dev 08:40, sector 79344
> > > > > I/@(protected)@(protected)@(protected)!?@
> > > > > FqA@(protected)`I/O
> > > > > GqA@(protected)@(protected)@(protected) I/Oh sh____mv@`error: dev
> 08:40,
> > > > > sector
> > > > > 800`I/O4 > I/O error: dev 08:40, sector 80336
> > > > > I/O error: dev 08:40, sector 80584
> > > > > I/O error: dev 08:40, sector 80832
> > > > > I/O error: dev 08:40, se@(protected)@(protected)@(protected)!?@
> > > > > GqA@(protected)`I/O
> > > > > HqA@(protected)@(protected)@(protected)
> > > > > I/Oherh____mv@`or 81576
> > > > > I/O error: dev `I/O0, sector 81824
> > > > > I/O error: dev 08:40, sector 82072
> > > > > I/O error: dev 08:40, sector 82320
> > > > > I/O error: dev 08:40, sector 82568
> > > > > I/O err@(protected)@(protected)@(protected)!?@
> > > > > HqA@(protected)`I/O
> > > > > IqA@(protected)@(protected)@(protected) I/Ohorh____mv@`: dev
> 08:40,
> > > > > sector 83312
> > > > > <4`I/OO error: dev 08:40, sector 83560
> > > > > I/O error: dev 08:40, sector 83808
> > > > > I/O error: dev 08:40, sector 84056
> > > > > Unable to handle kernel paging request at virtual address
> a0fb8488
> > > > > printing eip:
> > > > > c011f694
> > > > > *pde = 00000000
> > > > > Oops: 0000
> > > > > lp parport autofs tg3 floppy microcode keybdev mousedev hid input
> > > > > usb-ohci
> > > > > usbcore ext3 jbd raid1 raid0 lpfcdd mptscsih mptbase sd_mod
> scsi_mod
> > > > > CPU: -1041286984
> > > > > EIP: 0060:[ <c011f694 >] Not tainted
> > > > > EFLAGS: 00010087
> > > > >
> > > > > EIP is at do_page_fault [kernel] 0x54 (2.4.21-4.ELsmp)
> > > > > eax: f55ac544 ebx: f55ac544 ecx: a0fb8488 edx: e0b3c000
> > > > > esi: c1ef4000 edi: c011f640 ebp: 000000f0 esp: c1ef40c0
> > > > > ds: 0068 es: 0068 ss: 0068
> > > > > Process Dmu (pid: 0, stackpage=c1ef3000)
> > > > > Stack: 00000000 00000002 022c1008 c1eeee4c c1eff274 00000000
> > 00000000
> > > > > a0fb8488
> > > > > c17c4520 f58903f4 00000000 c1efd764 c1eee5fc f7fe53c4
> > 00030001
> > > > > 00000000
> > > > > 00000002 022c100c c1efd780 c1eeba44 00000000 00000000
> > 00000003
> > > > > c1b968ec
> > > > > Call Trace: [ <c011f640 >] do_page_fault [kernel] 0x0
> (0xc1ef4178)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef419c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef41b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4278)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef429c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef42b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4378)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef439c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef43b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4478)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef449c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef44b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4578)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef459c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef45b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4678)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef469c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef46b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4778)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef479c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef47b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4878)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef489c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef48b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4978)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef499c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef49b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4a78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4a9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4ab4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4b78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4b9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4bb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4c78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4c9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4cb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4d78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4d9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4db4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4e78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4e9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4eb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4f78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef4f9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef4fb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5078)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef509c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef50b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5178)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef519c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef51b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5278)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef529c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef52b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5378)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef539c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef53b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5478)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef549c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef54b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5578)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef559c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef55b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5678)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef569c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef56b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5778)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef579c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef57b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5878)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef589c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef58b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5978)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef599c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef59b4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5a78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5a9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5ab4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5b78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5b9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5bb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5c78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5c9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5cb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5d78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5d9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5db4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5e78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5e9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5eb4)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5f78)
> > > > > [ <c011f640 >] do_page_fault [kernel] 0x0 (0xc1ef5f9c)
> > > > > [ <c011f694 >] do_page_fault [kernel] 0x54 (0xc1ef5fb4)
> > > > >
> > > > > Code: 8b 82 88 c4 47 c0 8b ba 84 c4 47 c0 01 f8 85 c0 0f 85 46 01
> > > > >
> > > > > Kernel panic: Fatal exception
> > > > >
> > > > > Any Ideas?
> > > > > Thanks.
> > > > > -Mark
> > > > >
> > > > >
> > > > > --
> > > > > redhat-list mailing list
> > > > > unsubscribe
> > mailto:redhat-list-request@(protected)?subject=unsubscribe
> > > > > https://www.redhat.com/mailman/listinfo/redhat-list
>
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request@(protected)?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@(protected)?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list
|
|
 |