Page 1 of 2

explanation for the previously broken images.

PostPosted: Wed Mar 16, 2005 11:47 am
by slug
for the last couple of months, there were many broken images. a few hundred thousand images were affected.
these should be all fixed now except for maybe 1000 (out of 25million) which i'm still cleaning up one by one.

i really am sorry for the inconvenience, and admit the biggest problem was our lack of communication on the subject. our top priority though was fixing the problem.

the main reason we took so long to get them back online was cautiousness. perhaps overly so, but i would rather we took too long, than make a mistake.

now everything looks good, we've finished moving to the new ISP, and now we can move forward again.

we just ordered another 16Terabytes of storage to replace our current primary copy of images. this new system will be far more reliable and should take much less time to manage.

-slug

for those interested, here's a long rambling explanation of what happened. it was a combination of a mistake in our backup process combined with an extremely rare hardware failure, and the fact we were in the process of moving from one ISP to another.

we keep two copies of all images on live servers. the primary storage, and secondary mirror. we do tape backups to LTO-2 tapes.
if something goes wrong with the primary storage, we can immediately start serving the images from the secondary servers until we get the primary back online.
both primary and secondary storage systems are RAID-5. this is a group of hard drives set up in such a way that one disk can fail with no loss of service or data. we just replace the failed disk and it's back to normal.
our primary storage also has 2 hot spare disks in a group of 28 disks.
on the failed volume, somehow 4 disks failed in a short period of time.
the first 2 used up the hot spares, the 3rd failed and we were still okay, but when the 4th disk failed that section of storage went offline.
theoretically, even this wouldn't be a big problem, because the secondary mirror is supposed to be up to date within an hour of the primary.
but unfortunately this was not the case since we had already moved the secondary system to the new facility, and they weren't as synchronized as they should have been.

even though the raid5 had 2failed disks and was offline, i didn't believe the data was really gone. but on the same RAID controller was another volume still active. before i could work on the bad volume, i had to make sure there were two copies of the good volume elsewhere.

fortunately i had 3Terabytes of fresh storage to copy this too, but after weeks of debugging, it turned out to have a bad raid controller card which eventually Dell replaced. even with all the hardware working, it still literally takes days to copy terabytes to multiple places and verify that all the files really copied properly. right now we have multiple copies of about 90million files which all need to be accounted for.

once i had good copies of everything else, i could work on the failed RAID.
this went smoothly, but there were still file system errors which also took days to resolve.

this situation, while painful, turned out well in the end, and has cleaned up and strengthened our processes.

Re: explanation for the previously broken images.

PostPosted: Wed Mar 16, 2005 11:52 am
by srijith
slug wrote:i really am sorry for the inconvenience, and admit the biggest problem was our lack of communication on the subject. our top priority though was fixing the problem.

Thanks for the explanation Slug. I really hope the "biggest problem" can be properly handled once things settle down.

PostPosted: Wed Mar 16, 2005 1:02 pm
by clickaway
thanks for the explanation, slug.

this is very much appreciated.

Ray

PostPosted: Wed Mar 16, 2005 3:00 pm
by matiasasun
Thanks so much for all the details Slug.

Matias, Chile

PostPosted: Wed Mar 16, 2005 3:23 pm
by arjunrc
Slug,
This is all it takes to win back the loyalty of many many people who were getting frustrated about being kept in the dark.
I am very glad you decided to post this message - better late than never.

Now onto your technical explanation:
Your explanation makes sense for broken images to me, but I am not sure if it relates to broken 'direct-links' (image is still there, direct-link is broken). In this situation, the url to the real image http://www.pbase.com/u13/mikaska.kokass ... a/1234.jpg (okay, it looked something like this ;-) )worked fine but the http://www.pbase.com/arjunrc/image/1234.jpg did not. Was this a load balancer fault or something else ? Just curious

regds
arjun

PostPosted: Wed Mar 16, 2005 5:10 pm
by ilanphoto
Murphy, is the first thing that jumps to mind :lol:

what can go wrong - will and at the most inappropriate time

Thanks for the details, get some sleep and keep on going

PostPosted: Thu Mar 17, 2005 12:37 am
by sartobr
Slug .............. thanks for this post!!

PostPosted: Thu Mar 17, 2005 3:42 am
by sheila
Thanks Slug and Emily for your efforts. As one, amongst the many old timers of PBase, who waited patiently, our wait paid off. Personally, I never had a problem (maybe luck) with only a few "lost" thumbnails (which I reloaded - no worries). I think the biggest problem was the lack of communication. For a long time, there was no information forthcoming and a feeling that some folk felt they were abandoned. Maybe its the world we live in of the "instant fix". Anyway, thanks for this site. Its been a boon to folk such as me who sell images online without going through the angst of their own website.

Cheers
Sheila

PostPosted: Thu Mar 17, 2005 9:29 pm
by masrawy
Simply . . Thank you for your effort .

Sami

PostPosted: Fri Mar 18, 2005 7:11 am
by gillettecraig
What hypocrisy. Are we supposed to believe you really want to thank them or just want to get your tag out there?

My daughter came home from school crying again because her teacher belittles her beliefs and encourages ridiculing the President. You people make me sick. The party of Kennedy, Kerry and the Klansman.

PostPosted: Fri Mar 18, 2005 7:26 pm
by jeanb
Thanks for the explanation Slug.

When things were looking icky and we didn't know if you were going/going/gone under, I must admit to looking around with a view to finding another site that I liked as well as this one. You know what? There isn't one.

Oh yes, just ignore the previous 2 entries, some people just don't get it.

PostPosted: Fri Mar 18, 2005 11:02 pm
by stormseye
Cool! The pics are back! Thanks!

I think Craig is reacting to Sami's "signature" which includes several lines that are political in nature and will show up whenever and wherever he posts in the forums.

Hey Sami - you might want to take note of this and keep that sort of thing in the Politics, Law & Religion forum. Obviously there will be those who take offense.

PostPosted: Sat Mar 19, 2005 3:33 pm
by athiker95
The message Slug, that hopefully you are getting, is not an explanation of the technical things behind the scenes but rather communication with your customers. I've been a Pbase user for a long time and have had relatively few problems, but I feel the personal touch has been lost over time. I realize that with so many customers, that can occur easily, but steps should be taken to rectify it. People don't give a whit about terabytes - they only care about being able to see their pictures. If the site is down or having problems, then a prompt message on the home page informs everybody - I'm not seeing that happen and that can't take more than about 10 seconds of your time. That said, you've done a great job with this web site and I commend your efforts, especially since I know very little of the actual demands that go on behind the scenes. However, none of that makes a bit of difference, if your customers are aggravated and left in the dark. Customer Service sucks big time in America and "sucks" is an understatement - you can be different than that with little effort.

Mark

PostPosted: Mon Mar 21, 2005 5:27 am
by duane_bolland
Slug, I apreciate your message and I'm glad to hear things are turning for the better. And as a techie, I'm always curious about the real nuts-n-bolts issues.

Now only if we could get you to upgrade your hardware. Dell is about as bottom-of-the-barrel as you can get. :shock:

PostPosted: Thu Mar 24, 2005 9:48 pm
by redtop
Thanks Slug for the hard and diligent work in getting PBase and the thumbnails "whole" again. As mentioned by many users, communications os vital, and I was glad to finally see a description of the many, unexpected, problems which combined to almost bring your system to its knees.

PBase seems to be running great again, and I hope that will continue for a long time. I feel you have the best photo site on the Web.

Keep it up.