Update 18-Dec-2007 01:50 UTC
Finally. After what seemed like an eternity, the filesystem repair process completed successfully. It was tempting to take a chance and try to put things back online immediately, but fortunately the engineers at NetApp insisted that we carefully check and repair everything first to avoid data loss.
Everything should be back online and normal with two kinds of exceptions.
1. Anyone that was uploading anything at the occurence of the problem on Saturday afternoon might have some broken images. If after uploading, you could see your photos, then they should be fine. If they seemed immediately broken, then I can't fix them since they were never even written to the disk here. (If there are any instances of this problem, there will be very few I think)
2. Anyone who used the regenerate thumbnail feature during the last day will have no thumbnails. I should have just disabled the tool during that time. The effect is that you won't have thumbnails right now and you'll see the original where you would expect a thumbnail. I'm running a script now to go through and find all these cases and fix them.
If tomorrow you're still seeing any broken images or gigantic thumbnail that the automated process I'm running now should have fixed, send me an email at slug@pbase.com and I'll take care of them.
I have new equipment here for which I'm finishing up their configuration. These will allow us to immediately serve out all the photos in the event that the main storage needs repair like this again. Some images during the last two days were in fact coming from these machines, but I didn't have them completely ready yet. Soon.
Really sorry for all the trouble. I know it's painful to have visitors to your site find broken images.
-Chuck Neel
slug@pbase.com
==========
Original post from earlier before the fix.
==========
Here's an update on the progress of fixing the problem of broken images.
I've been working non-stop with the engineers at NetApp to resolve this problem.
Until now, I haven't had a good idea of how long until resolution since the NetApp guys haven't been able to predict it due to the large size of the system. Now I'm hoping the process will be complete within 8 hours, but I can't be sure since I haven't had to go through this before.
On Saturday, we lost 3 disks simultaneously in our main storage system which runs on NetApp hardware. This caused an 8 Terabyte volume to have some inconsistencies which have to be analyed and repaired before we can put the volume back online. Fortunately, the cause of the problem is something NetApp understands and they've provided updated firmware for the disk shelves to correct the bug responsible.
Right now we're just waiting for the filesystem analysis program to complete. At that time, we'll be able to bring the volume back online and all of your photos will display properly.
While this has been a painful process, and I apologize for the disruption, the NetApp system is designed to recover from such failures. Also, their support team, while not cheap, is amazingly responsive and doesn't hesitate to send parts immediately, or spend hours of time on the phone walking us through all the steps to recover.
Fortunately, with the exception of a couple hours on Saturday, new uploads, direct linking, and the majority of the site have been working as usual.
I wish the recovery process could have gone faster, but after a problem with the filesystem, it's important to analyze it carefully so we can be sure everything is healthy.
Chuck Neel
slug@pbase.com