Nov 14

Well, the upgrade to Intrepid went smoothly during the install process itself.  However after a reboot, the system hung partially through boot and dropped to a initramfs shell claiming “cannot find root device /dev/disk/by-uuid/50128bb8…” and “Gave up waiting for root device.”  Wonderful.  Tinkered around a bit, tried mounting drives manually only they were not listed in /dev.  Attempted booting old 2.6.24-18 kernel which worked fine.  Aha, so it’s something related to new kernel.  Did a quick search which revealed the following bug:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/290153

Apparently on certain hardware the kernel has a bug which causes a long timeout for the SCSI/SATA bus.  It took a good 2-3 minutes on my system but when I left it idle while I was reading the bug report on my desktop system, a bunch more lines flew by from the initramfs prompt about ata bus reset and detecting new drives.  After that a simple ‘exit’ from prompt continued a normal boot.

It’s a fairly important bug but at least a workaround exists.  I’ve tinkered with adding the ‘rootdelay’ option to my menu.lst but have not found the best match yet.  Maybe I’ll just leave it as is, my server almost never gets rebooted.  You’re instilling me with a lot of confidence doc, I mean Intrepid.  Definitely going to make a full backup of my desktop machine before attempting upgrade on that one.

Nov 1

Finally!   After awaiting a solution for sometime to be released, Ibex will  have kernel 2.6.26 which  supports read-only bind mounts.  I discovered this as a rather serious security breach to my proposed system design  some time ago – I was trying to implement remote access for myself and friends to my data storage.  I figured some form of SSH-based access would be a good start, but I didn’t want to have any accounts directly open on my server or desktop machine.  Since building separate physical hardware just for this would be a waste of resources, I thought the best solution would be a virtual machine.  Configuring NFS on it could potentially be another security hole (not to mention more overhead then needed), I knew a bind mount would be perfect – a read-only one of course.  However as I was testing it I quickly realized that read-only bind mounts weren’t actually read-only.  Thus, the problem.  I suppose since I keep most of my multimedia files marked immutable it wouldn’t be a real problem unless someone got root.  Still rather be safe then sorry.  More to follow about this later.

I read over a few postings to the kernel mailing list which addressed this last year, but this was just in the development phase then.  The solution the kernel architects created involved updating the VFS code, since all bind mounts are implemented in the virtual layer.  You can read more of the technical aspects over at lwn.net.

Oct 5

I was taking a mid-afternoon nap (yes at 3 am, I work nights) and I came back to my PC to see CPU usage on my server hovering around 15% – not at idle like usual.  Doing a quick check revealed md0_raid5 and md0_resync running which is normally not a good sign.

mdadm –detail /dev/md0 showed the following:

    Update Time : Sun Oct  5 03:22:19 2008
          State : clean, recovering
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
         Layout : left-symmetric
     Chunk Size : 64K
 Rebuild Status : 85% complete

Uh oh.  Why was the array rebuilding itself?  All drives were listed as active and working …  but did we experience a drive momentarily dropping from the array or a SATA device reset?  Was this a sign of impending hardware failure?  Tailing /var/log/messages displayed this useful piece of information:

Oct  5 01:06:01 rigel md: data-check of RAID array md0

Ok, so “data-check” doesn’t sound so worrysome.  A quick Google search revealed this nice gem:

root@rigel:~# tail /etc/cron.d/mdadm
# By default, run at 01:06 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
6 1 * * 0 root [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ] && /usr/share/mdadm/checkarray --cron --all --quiet

Ah, so this is the first Sunday of the month and the check kicked off at 1:06 AM.  You trixies Ubuntu.  Apparently a bug has been filed causing performance issues on some boxes.  Good idea to verify data integrity, although slightly more obvious notice would be nice.