Consistent software RAID disk errors on nVidia MCP51

My brand new pair of disks in a setup that would otherwise work perfectly fine, decided to keep giving me headache. There are only two disks in the machine, partitioned in three, with the first two pairs of partitions in a RAID1 array. And every now and then one of the disks would disappear from the system, then re-appear with a different drive letter (e.g. sda would become sdc). Obviously the disk would get kicked out of the RAID, but I can then add it back and wait for the 1.5TB partition to sync back the RAID. The errors look like this:


Oct 29 03:44:59 lion kernel: ata1.00: n_sectors mismatch 3907029168 != 268435455
Oct 29 03:44:59 lion kernel: ata1.00: revalidation failed (errno=-19)
Oct 29 03:44:59 lion kernel: ata1.00: limiting speed to UDMA/133:PIO2
Oct 29 03:44:59 lion kernel: ata1: hard resetting link
Oct 29 03:44:59 lion kernel: ata1: nv: skipping hardreset on occupied port
Oct 29 03:45:00 lion kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 29 03:45:00 lion kernel: ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Oct 29 03:45:00 lion kernel: ata1.00: revalidation failed (errno=-5)
Oct 29 03:45:00 lion kernel: ata1.00: disabled
Oct 29 03:45:00 lion kernel: sd 0:0:0:0: rejecting I/O to offline device
Oct 29 03:45:00 lion kernel: sd 0:0:0:0: rejecting I/O to offline device
Oct 29 03:45:00 lion kernel: sd 0:0:0:0: rejecting I/O to offline device
Oct 29 03:45:00 lion kernel: end_request: I/O error, dev sda, sector 206856
Oct 29 03:45:00 lion kernel: md: super_written gets error=-5, uptodate=0
Oct 29 03:45:00 lion kernel: md/raid1:md127: Disk failure on sda2, disabling device.
Oct 29 03:45:00 lion kernel: <1>md/raid1:md127: Operation continuing on 1 devices.
Oct 29 03:45:00 lion kernel: ata1: hard resetting link
Oct 29 03:45:00 lion kernel: RAID1 conf printout:
Oct 29 03:45:00 lion kernel: --- wd:1 rd:2
Oct 29 03:45:00 lion kernel: disk 0, wo:1, o:0, dev:sda2
Oct 29 03:45:00 lion kernel: disk 1, wo:0, o:1, dev:sdb2
Oct 29 03:45:00 lion kernel: RAID1 conf printout:
Oct 29 03:45:00 lion kernel: --- wd:1 rd:2
Oct 29 03:45:00 lion kernel: disk 1, wo:0, o:1, dev:sdb2
Oct 29 03:45:01 lion kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 29 03:45:01 lion kernel: ata1.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133
Oct 29 03:45:01 lion kernel: ata1.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 0/32)
Oct 29 03:45:01 lion kernel: ata1.00: configured for UDMA/133
Oct 29 03:45:01 lion kernel: ata1: EH complete
Oct 29 03:45:01 lion kernel: ata1.00: detaching (SCSI 0:0:0:0)
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sda] Stopping disk
Oct 29 03:45:01 lion kernel: ata2.00: configured for UDMA/133
Oct 29 03:45:01 lion kernel: ata2: EH complete
Oct 29 03:45:01 lion kernel: scsi 0:0:0:0: Direct-Access ATA WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sdd] Write Protect is off
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Oct 29 03:45:01 lion kernel: sd 0:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 29 03:45:02 lion kernel: ata3.00: configured for UDMA/133
Oct 29 03:45:02 lion kernel: ata3: EH complete
Oct 29 03:45:02 lion kernel: EXT4-fs (dm-4): re-mounted. Opts: acl,user_xattr,commit=0
Oct 29 03:45:02 lion kernel: sdd: sdd1 sdd2 sdd3
Oct 29 03:45:02 lion kernel: sd 0:0:0:0: [sdd] Attached SCSI disk


I found a few similar issues in RedHat's bugzilla and in the Ubuntu forums.

And I even tried their advice. I added sata_nv.swncq=0 to me kernel line, which did not help at all. I tried replacing the SATA cables with quality cables, and that didn't help either. Then I forced the SATA link speed to 1.5G and now I have been running for more than a week with no problems with the disks. My command line currently looks like this and everything is happy:
$ cat /proc/cmdline
BOOT_IMAGE=/kernel-genkernel-x86_64-2.6.36-y4 root=/dev/mapper/lionvg-root ro dolvm domdadm nmi_watchdog=0 max_loop=32 sata_nv.swncq=0 libata.force=noncq,1.5G

Comments

  1. If this resolves my issue with my mcp51 raid6 issues on Ubuntu 12.04 server I am going to owe you a beer. I almost ordered a replacement mobo today, fingers crossed after this change.

    ReplyDelete

Post a Comment

Popular posts from this blog

FreeIPA cluster with containers

ADSL Router Model CT-5367 user and pass (VIVACOM)

Installing Gentoo with full disk encryption