Handle Boot Failure due to Failed Disk Device


Faiulre of disk device is common in large-scale clusters. When a disk device holding some mounted disk partitions fails, Linux system will halt on boot.

1. Boot to single mode:
a) at the grub console
# 'ESC'
Select the proper kernel
# 'e'
Append 'single' to the kernel parameters
# 'b'
b) enter the repair mode
# mount
c) edit /etc/fstab to unmount some disk device on boot
Comment all entries except /, /boot and swap
d) reboot
# Clt + ALT + DEL

2. Debug which disk device fails:
a) show which disk devices still work
# ls /dev/sd*
b) for each normal disk device
# bakblocks -s -v 'disk-device'
During checking, see which disk device's LED keeps blinking c) pinpoint the failed disk device
For those disk device whose LEDs do not blinking, they are failed
d) shut down to replace the failed disk device
# poweroff

3. Replace the failed disk device:
a) add new system to MegaRAID system
Go to physical device view and make the new disk device 'unconfig good'
Go to the Configuration Wizard and add the new disk device to RAID
b) check the new disk device
# ls /dev/sd*
c) update /etc/fstab
Update the uuid of entries holding the lost storage
Uncomment the previously commented entries