for the last couple of months, there were many broken images. a few hundred thousand images were affected.
these should be all fixed now except for maybe 1000 (out of 25million) which i'm still cleaning up one by one.
i really am sorry for the inconvenience, and admit the biggest problem was our lack of communication on the subject. our top priority though was fixing the problem.
the main reason we took so long to get them back online was cautiousness. perhaps overly so, but i would rather we took too long, than make a mistake.
now everything looks good, we've finished moving to the new ISP, and now we can move forward again.
we just ordered another 16Terabytes of storage to replace our current primary copy of images. this new system will be far more reliable and should take much less time to manage.
-slug
for those interested, here's a long rambling explanation of what happened. it was a combination of a mistake in our backup process combined with an extremely rare hardware failure, and the fact we were in the process of moving from one ISP to another.
we keep two copies of all images on live servers. the primary storage, and secondary mirror. we do tape backups to LTO-2 tapes.
if something goes wrong with the primary storage, we can immediately start serving the images from the secondary servers until we get the primary back online.
both primary and secondary storage systems are RAID-5. this is a group of hard drives set up in such a way that one disk can fail with no loss of service or data. we just replace the failed disk and it's back to normal.
our primary storage also has 2 hot spare disks in a group of 28 disks.
on the failed volume, somehow 4 disks failed in a short period of time.
the first 2 used up the hot spares, the 3rd failed and we were still okay, but when the 4th disk failed that section of storage went offline.
theoretically, even this wouldn't be a big problem, because the secondary mirror is supposed to be up to date within an hour of the primary.
but unfortunately this was not the case since we had already moved the secondary system to the new facility, and they weren't as synchronized as they should have been.
even though the raid5 had 2failed disks and was offline, i didn't believe the data was really gone. but on the same RAID controller was another volume still active. before i could work on the bad volume, i had to make sure there were two copies of the good volume elsewhere.
fortunately i had 3Terabytes of fresh storage to copy this too, but after weeks of debugging, it turned out to have a bad raid controller card which eventually Dell replaced. even with all the hardware working, it still literally takes days to copy terabytes to multiple places and verify that all the files really copied properly. right now we have multiple copies of about 90million files which all need to be accounted for.
once i had good copies of everything else, i could work on the failed RAID.
this went smoothly, but there were still file system errors which also took days to resolve.
this situation, while painful, turned out well in the end, and has cleaned up and strengthened our processes.