October « 2009 « NinerNet Communications™ (System Status)

Description of issues surrounding hardware failure on server NC018

21 October 2009 10:25:16 +0000

(This post has been edited to correct the date of the outage from September to October. Apologies for any confusion.)

Description of the problem

On 20 October at approximately 19:37 UTC, server NC018 — on which most clients and services are hosted — became unresponsive. This was immediately noticed by automated monitoring systems in the data centre, and staff at the data centre went to physically check the server. As a result, the server was rebooted. However, the server again became unresponsive a few minutes later, and was again rebooted. After the third attempt at rebooting, it became clear that the cause of the problem was likely a hardware issue. (Malicious activity was ruled out fairly quickly based on network activity.)

What was done to fix the problem?

The server was then taken offline and examined with the intent of replacing the processor and memory as likely causes of the problem. During the examination and replacement it became clear that two memory slots on the motherboard had failed. As a result, the motherboard, the memory and the power supply were replaced with new parts. The server was then placed back on the network, powered up, and has been up since 21:57 UTC.

What was done to inform clients?

We have a number of mechanisms in place to communicate with clients, specifically on operational issues such as this one:

This status page,
Email,
External “microblogging” services Twitter and Identi.ca,
Prominent notices which can be posted on the NinerNet website, and
Telephone.

As server NC018 is our primary server, options 1 (status page), 2 (email) and 4 (website notices) were not available; option 3 (Twitter and Identi.ca) was not used soon enough; option 5 (phoning clients) would only be used as a last resort in the event of a significant and extended outage of some sort. However, we did, of course, answer or return phone calls that we received.

Please see below for what we intend to do to improve communication during events like this.

What we already do to prevent this from happening

We already take a number of proactive steps to do our best to prevent this sort of outage from occurring, but it’s simply inevitable that these sorts of events occur. However, that doesn’t mean we can’t do more, and it doesn’t mean we don’t learn from any event that does happen. This is what we already do:

Data centre: Perhaps the most important thing we do is physically locate our servers within high-quality data centres — high quality in terms of the physical infrastructure, the network infrastructure, and the staff operating it. In the case of server NC018, we use a data centre operated by a company called Rackspace. (You can read about Rackspace and their data centres on their website.) Rackspace has been in business since 1998 and is the gold standard by which other companies operating data centres are judged. This kind of reputation comes at a price — measured in dollars — but at every turn they have proven that they earn this premium. Put another way, NinerNet does not entrust any part of the infrastructure supporting your online business or identity to the lowest bidder. With some other data centres, we’d be making this post a week from now, long after there was no longer any point in doing so.
Real time service monitoring: The server and the services running on it are monitored 24/7/365, by staff at the data centre, by NinerNet, and by you. In the event that a service goes down for any reason, we have found that we and the data centre are aware of the problem almost simultaneously. Sometimes we are able to solve the issue ourselves within minutes, and other times we need someone at the data centre to solve the issue. The latter is obviously the case when the entire server becomes unresponsive, as happened this time.
Maintenance and pre-emptive monitoring: Servers generate voluminous amounts of data letting a server administrator know what is going on on the server and whether or not there may be any problems that need attention. We monitor this information every day and, if necessary, we take steps to deal with issues before they become a problem. Such monitoring leads to maintenance, all of which (at the server level) is documented on this status page. Monitoring of security alerts put out by security organisations and software vendors also leads to maintenance.
Back-ups: We back up your databases locally and to remote servers every day. Databases usually contain information that would be difficult for you to keep locally, as they are usually being updated throughout the day and night. If something happens, you’ll need a copy of your database to set everything up again. We also provide self-managed back-up tools in the control panel so that you can also back up other items, such as the files that constitute your website and mail spools.

This is only an overview of a few of the many activities we undertake to ensure that you shouldn’t have to worry about your hosting. Within and in addition to each of these brief summaries are other more minor activities and considerations that all dovetail to provide you with the best hosting experience and value for your money.

Could this particular problem have been prevented?

It’s possible, yes. Almost a month ago we had a similar problem with this server, but rebooting it once brought the server back online with no apparent problems. We and staff at the data centre tried to find the cause, but there was nothing to point to one. Had it occurred to someone at the time that the problem might have been caused by hardware, we’d have had to take the server offline anyway just to investigate, never mind fix the problem. With no indication of a hardware problem — even had the thought occurred to someone — there was no reason to take the server offline just because it might have led to a eureka moment. Doing so might have resulted in the same amount of downtime — albeit at a more convenient time — but with the chance that nothing would have been found thereby making the downtime unnecessary.

What more will we do to prevent this from happening?

All of that said, there is more we can do:

Hardware: We were already consulting with the data centre last Friday to phase in a replacement for this server. (Please note that, although the data centre is not ours, the server itself is, and is 100% under the control of NinerNet.) We don’t have an exact date for this replacement yet because it depends on the next release of the software that drives the web-based control panel that we use. However, Rackspace anticipate completing their extensive testing of this software within about a month, and we anticipate migrating to the new hardware about a month after that. (This is another reason we choose to use Rackspace. While we also do our own testing of the control panel in advance of implementing it, the resources available to Rackspace exceed our own.) In this case, while the faulty hardware has now been replaced, a new server would obviously address ageing hardware issues.
Communication: We can do a better job of communicating:
- We will mirror this status page on a secondary server to ensure that it is available even if one server goes down.
- We will set up a secondary mass-email system so that we can send out notices of serious incidents like this one.
- We will ensure that we post notices on Twitter and Identi.ca in a more timely fashion, and we suggest that you subscribe to one of these feeds.
- We will also mirror the main NinerNet website on a secondary server so that you are able to see any notices posted there.
Setting up these items will take some time, especially the last item, but we will report on our progress. Communication is something we’ve given considerable thought recently, but it’s important for us to find a balance between overwhelming you with “spam” and being so quiet that you forget who we are. Right now I think we’re on the latter end of the scale, but we definitely don’t want to be on the other end.
Redundancy: While we do have some redundancy built into our domain name system, we will improve this to ensure that other servers that rely on the primary nameservers on NC018 continue to operate even if there are problems on NC018.

How do we compare?

We’re in good company when it comes to downtime. Even the biggest operations that spend the equivalent of the GDP of small countries on infrastructure have their problems. Causes include mechanical failure, human error, human malice, weather, software bugs and even real live bugs! (In fact, an old urban legend says that the term “bug”, as applied to software, came from the discovery of an actual bug — a moth — in the workings of a computer that was generating errors.) Twitter has become one of the most well-known online brands in the last couple of years, while being notorious for downtime — ironic, considering we use it, but we use a second service for a reason. (Of course, there you get what you pay for.) Fifty-five million people in the eastern United States and Canada suffered between several hours and two days of electrical downtime in 2003. Google, Amazon, YouTube, Barclays Bank, MySpace, Facebook, PayPal, Microsoft, eBay — the list goes on — are all among a list of big-name companies that have all experienced news-breaking downtime measured in hours, not just minutes. (Just do a web search for the word “outage” and the name of any big company.) Have a Blackberry? Do you realise that all Blackberry emails in the whole world go through one data centre in central Canada, and if that data centre has a problem, you can still use your Blackberry for a paperweight? Nobody is immune; nobody gets away unscathed. (For some relevant articles and company names, see the links at the end of this post… if you’re curious and have the time.)

In all seriousness, the point is not to deflect attention from our own problem yesterday or to say that there’s no point in trying to prevent downtime because it will happen anyway, even to those who spend lots of time and money trying to outrun it. The point is that, despite the best of intentions, these things happen, and they happen whether you’re hosting with a small hosting company or a large hosting company. I think what differentiates NinerNet from our competition is that we realise you’ve trusted us with something important to you, we know you pay us a little more than the bargain basement hosting companies, and we treat you accordingly.

This is our first major service interruption since January 2005. Nobody ever wants to be down for even a few seconds, but the reality is… “stuff” happens. We do our very best to make sure it doesn’t, but it does, and we appreciate your patience and understanding when it does.

For the record, here is a log of our uptime for August, September and October (the latter extrapolated to a full month):

August: 100%
September: 99.98%
October: 99.69%

September falls within acceptable guidelines; October does not.

You can read more about uptime on the HostMySite.ca website.

Conclusion

I hope I’ve given you enough information to fully understand what went wrong yesterday, why it went wrong, how we’ll work diligently to try to ensure it doesn’t happen again, and what we’ll do to keep you informed when the inevitable happens. We know how you feel when things don’t work; after all, we were down too.

If you have absolutely any questions, concerns, comments or even brickbats, please contact us to let us know. Thank-you.

Craig Hartnett

Appendix: Articles chronicling major outages

Here are some articles to back up the statements above about high-profile outages (and to put things in perspective), in case you’re curious, starting with the most recent (just over a week ago) and going back to 2008. The quotes (to which I’ve added emphasis of a few points) give a good overview of the thrust of the articles:

From Sidekick to Gmail: A short history of cloud computing outages (Network World, 2009-10-12)
- “This past week’s Microsoft-T-Mobile-Sidekick data loss mess is the latest in a string of high profile cloud computing outages that have grabbed headlines over the past couple of years.”
Hitachi implicated in Sidekick outage (IT Knowledge Exchange, 2009-10-12)
- “News broke this morning of an outage for users of the Sidekick mobile smartphone, in which T-Mobile warned users of the device not to power down their phones, or personal data would be irretrievably lost thanks to a server outage at Danger, a Microsoft subsidiary that supports the Sidekick.”
Rethinking Gmail: Reliability Matters (Linux Magazine, 2009-09-25)
- “Having worked in a data center, I have some sympathy for a company when there’s some downtime. Murphy’s Law can strike with a vengance [sic], and I’m convinced that there is no such thing as enough redundancy to fully prevent service outages completely. What’s important is that those outages are minimal, managed well, and learned from.”
Outages Happen, Support Matters! (No More Servers, 2009-09-15)
- “While this was our first [outage] in years, we are very disappointed to have inconvenienced our customers. We fixed the problems … and are working hard to return to our decade long history of consistent uptime. In the past year, there have been outages at colocation provider Equinix, CRM provider Salesforce.com, online retailer eBay, cloud hosting provider Amazon, and yet the most recent outage at Gmail last week got more media attention than all combined. … While it is important to know the uptime history of a company, the more important point and question should be how each company responds to outages. Because, just like number two, ‘outages happen.’ During an outage, customers are rightly concerned, frustrated, and upset. The fact that outages happen is of no consolation while the outage is current. What does matter in the moment, however, is how much information is supplied to explain what is happening and when it will be fixed. After resolution, what matters is explaining what caused the problem, what will be done to prevent the same problem in the future, and how will they be compensated for their troubles.”
Google mail suffers widespread outage (Canada.com, 2009-09-01)
- “By around 5:20 p.m. edt (9:20 GMT), many U.S. users said they were able to use their Gmail service again, but the outage lasted for over an hour. It appeared to affect users around the world, with people from England, Italy, Singapore and South Africa writing to the company’s support site to report problems.”
Storage reliability questioned after high profile outages (Periscope Information Technologies, 2009-08-11)
- “Internet service provider Primus, which is based in Australia, suffered several hours of downtime as a result of a sub-station fault which prevented a back-up generator from starting.”
PayPal Users Hit by Global Service Outage (The Wall Street Journal, 2009-08-04)
- “EBay Inc.’s PayPal unit experienced a world-wide system outage on Monday, leaving millions of customers of the online payments service temporarily unable to complete transactions. The outage began around 1:30 p.m. EDT, and affected all PayPal customers for about an hour, said spokesman Anuj Nayar. By about 6:30 p.m. EDT, service had been restored to all customers, the company said.”
The Day After: A Brutal Week for Uptime (Data Center Knowledge, 2009-07-06)
- “Last week was a brutal one for the data center industry, with high-profile outages for several companies with strong reputations for uptime, and a fire at a data center complex that raised tough questions about redundancy and responsiveness of a number of high-profile sites. … Failover to Backup: Not Always Simple: Seattle real estate site Redfin emerged from the Fisher Plaza incident as the poster child for thinking ahead. ‘We … basically instituted a disaster avoidance plan where we had redundant-everything for our mission-critical databases, servers and networks in separate buildings.’ … Redfin was back online by 4 a.m., about 5 hours after the fire. Not so for payment gateway Authorize.net, arguably a far more critical service in light of its impact on e-commerce. Authorize.net said that it had a backup facility, but it was not able to switch over in a timely fashion due to a ‘perfect storm’ of challenges.”
Disc array fault blamed for Barclays online breakdown (Finextra Research, 2009-06-17)
- “The hardware failure left customers unable to withdraw cash at about 1500 ATMs in the South of England between around 1:00pm and 4:30 pm. Online and phone bankers in the South were also unable to conduct transactions, while a small number of customers experienced difficulty using their cards to make payments in shops. … one disgruntled user contacted the BBC to vent his anger after his card was rejected as he was trying to buy a Ferrari.”
Frustration, distress over Google outage (AlertProfs.com, 2009-05-19)
- “Millions of people got a taste of life without Google on Thursday after its search engine, e-mail and other products slowed or became inaccessible because of a glitch. The outage, which lasted nearly two hours, prompted a wave of distress and frustration among users, highlighting just how dependent they are on the Internet giant’s ubiquitous services. It was also an embarrassing failure for Google, which casts itself as a technological dynamo and markets many of its products for their dependability.”
When Gmail Fails, Users Adapt (The GigaOM Network, 2009-02-24)
- “… we can face a failure and move on to the next tool in our arsenal with only a few minutes of complaining. … Think about what you will accept from a cell phone in terms of lost connections and dropped calls. Reliability is not keeping us tethered to our landlines by any stretch of the imagination.”
Headaches Continue at Gmail, New Outages Reported (Webmonkey [Wired], 2008-10-20)
- “Though Monday’s problems are causing headaches, they don’t yet add up to last week’s trauma. Last Thursday night, Google suffered a high-profile outage to its Google Apps Premiere Edition (GAPE) service, which unlike free Gmail, is a paid service aimed at businesses. For some, the outage lasted over 24 hours.“
Why Amazon Went Down, and Why It Matters (The GigaOM Network, 2008-06-06)
- “Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and now appears to be back up. Amazon’s not naming names — all that director of strategic communications Craig Berman would say was that: ‘Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems.'”
Breaking: Amazon Down! Anyone Know What’s Up? (The GigaOM Network, 2008-06-06)
- “‘The Amazon retail site was down for approximately 2 hours earlier today (beginning around 10:25) — and we’re bringing the site back up.'”
Amazon computing Web service suffers glitch (CNET News, 2008-04-07)
- “Amazon.com’s Elastic Compute Cloud Web service was knocked offline earlier Monday, but the company appeared to get it back online within a few hours. … It’s a far milder outage than the one that occurred in February, when Amazon’s Simple Storage Service went down, which appeared to affect hundreds of Web sites.”
Amazon storage ‘cloud’ service goes dark, ruffles Web 2.0 feathers (CNET News, 2008-02-15)
- “The service was restored a few hours later, according to an Amazon technician. The first forum posting was timed at 5 a.m. PT, and the service was back up at just past 9 a.m.”
BlackBerry suffers massive outage (The Globe and Mail, 2008-02-12)
- “Millions of BlackBerry users were cut off from their wireless lifelines Monday when a massive server outage caused the popular handheld devices to fail across North America. Research in Motion, the maker of the BlackBerry, issued a brief statement confirming the problem began at approximately 3:30 p.m. and apologizing to customers. But it was not immediately clear what caused the outage, and service was restored approximately three hours later. … Concerns were raised by analysts that it could happen again, but RIM co-chief executive officer Jim Balsillie told Reuters news agency at the time that such outages were ‘rare.’“

Here is a general article covering outages and downtime:

Five Nines on the Net is a Pipe Dream (The GigaOM Network, 2008-07-06)
- “We’ve written about how hard it is to create a 99.999 percent up time championed by the telecommunications industry, but suffice to say there are a ton of moving parts involved in keeping a site visible to the end users. … Along the way there are software upgrades, server shortages, DNS issues, cut cables, corporate firewalls, carriers throttling traffic and infected machines. … But it will never be possible to keep all sites across the entire web up 99.999 percent of the time.“

Tags: hardware
Categories: Administration, Incidents, NC018
Comments Off

System	Status	Ping
NC023	Internal	Up?
NC028	Internal	Up?
NC031	Internal	Up?
NC033	Operational	Up?
NC034	Internal	Up?
NC035	Operational	Up?
NC036	Operational	Up?
NC040	Internal	Up?
NC041	Operational	Up?
NC042	Operational	Up?

Server and System Status