Server NC020 was rebooted at 14:34 UTC after it was reconfigured with more RAM.
Server NC020 was rebooted at 14:34 UTC after it was reconfigured with more RAM.
(This post has been edited to correct the date of the outage from September to October. Apologies for any confusion.)
On 20 October at approximately 19:37 UTC, server NC018 — on which most clients and services are hosted — became unresponsive. This was immediately noticed by automated monitoring systems in the data centre, and staff at the data centre went to physically check the server. As a result, the server was rebooted. However, the server again became unresponsive a few minutes later, and was again rebooted. After the third attempt at rebooting, it became clear that the cause of the problem was likely a hardware issue. (Malicious activity was ruled out fairly quickly based on network activity.)
The server was then taken offline and examined with the intent of replacing the processor and memory as likely causes of the problem. During the examination and replacement it became clear that two memory slots on the motherboard had failed. As a result, the motherboard, the memory and the power supply were replaced with new parts. The server was then placed back on the network, powered up, and has been up since 21:57 UTC.
We have a number of mechanisms in place to communicate with clients, specifically on operational issues such as this one:
As server NC018 is our primary server, options 1 (status page), 2 (email) and 4 (website notices) were not available; option 3 (Twitter and Identi.ca) was not used soon enough; option 5 (phoning clients) would only be used as a last resort in the event of a significant and extended outage of some sort. However, we did, of course, answer or return phone calls that we received.
Please see below for what we intend to do to improve communication during events like this.
We already take a number of proactive steps to do our best to prevent this sort of outage from occurring, but it’s simply inevitable that these sorts of events occur. However, that doesn’t mean we can’t do more, and it doesn’t mean we don’t learn from any event that does happen. This is what we already do:
This is only an overview of a few of the many activities we undertake to ensure that you shouldn’t have to worry about your hosting. Within and in addition to each of these brief summaries are other more minor activities and considerations that all dovetail to provide you with the best hosting experience and value for your money.
It’s possible, yes. Almost a month ago we had a similar problem with this server, but rebooting it once brought the server back online with no apparent problems. We and staff at the data centre tried to find the cause, but there was nothing to point to one. Had it occurred to someone at the time that the problem might have been caused by hardware, we’d have had to take the server offline anyway just to investigate, never mind fix the problem. With no indication of a hardware problem — even had the thought occurred to someone — there was no reason to take the server offline just because it might have led to a eureka moment. Doing so might have resulted in the same amount of downtime — albeit at a more convenient time — but with the chance that nothing would have been found thereby making the downtime unnecessary.
All of that said, there is more we can do:
Setting up these items will take some time, especially the last item, but we will report on our progress. Communication is something we’ve given considerable thought recently, but it’s important for us to find a balance between overwhelming you with “spam” and being so quiet that you forget who we are. Right now I think we’re on the latter end of the scale, but we definitely don’t want to be on the other end.
We’re in good company when it comes to downtime. Even the biggest operations that spend the equivalent of the GDP of small countries on infrastructure have their problems. Causes include mechanical failure, human error, human malice, weather, software bugs and even real live bugs! (In fact, an old urban legend says that the term “bug”, as applied to software, came from the discovery of an actual bug — a moth — in the workings of a computer that was generating errors.) Twitter has become one of the most well-known online brands in the last couple of years, while being notorious for downtime — ironic, considering we use it, but we use a second service for a reason. (Of course, there you get what you pay for.) Fifty-five million people in the eastern United States and Canada suffered between several hours and two days of electrical downtime in 2003. Google, Amazon, YouTube, Barclays Bank, MySpace, Facebook, PayPal, Microsoft, eBay — the list goes on — are all among a list of big-name companies that have all experienced news-breaking downtime measured in hours, not just minutes. (Just do a web search for the word “outage” and the name of any big company.) Have a Blackberry? Do you realise that all Blackberry emails in the whole world go through one data centre in central Canada, and if that data centre has a problem, you can still use your Blackberry for a paperweight? Nobody is immune; nobody gets away unscathed. (For some relevant articles and company names, see the links at the end of this post… if you’re curious and have the time.)
In all seriousness, the point is not to deflect attention from our own problem yesterday or to say that there’s no point in trying to prevent downtime because it will happen anyway, even to those who spend lots of time and money trying to outrun it. The point is that, despite the best of intentions, these things happen, and they happen whether you’re hosting with a small hosting company or a large hosting company. I think what differentiates NinerNet from our competition is that we realise you’ve trusted us with something important to you, we know you pay us a little more than the bargain basement hosting companies, and we treat you accordingly.
This is our first major service interruption since January 2005. Nobody ever wants to be down for even a few seconds, but the reality is… “stuff” happens. We do our very best to make sure it doesn’t, but it does, and we appreciate your patience and understanding when it does.
For the record, here is a log of our uptime for August, September and October (the latter extrapolated to a full month):
September falls within acceptable guidelines; October does not.
You can read more about uptime on the HostMySite.ca website.
I hope I’ve given you enough information to fully understand what went wrong yesterday, why it went wrong, how we’ll work diligently to try to ensure it doesn’t happen again, and what we’ll do to keep you informed when the inevitable happens. We know how you feel when things don’t work; after all, we were down too.
If you have absolutely any questions, concerns, comments or even brickbats, please contact us to let us know. Thank-you.
Craig Hartnett
Here are some articles to back up the statements above about high-profile outages (and to put things in perspective), in case you’re curious, starting with the most recent (just over a week ago) and going back to 2008. The quotes (to which I’ve added emphasis of a few points) give a good overview of the thrust of the articles:
Here is a general article covering outages and downtime:
Server NC018 was offline between approximately 19:37 and 21:57 UTC on 20 September October. This was due to an issue with the memory slots on the motherboard. The motherboard, the memory and the power supply were all replaced with new parts, and the server is back online.
We will be posting a full accounting of this incident within the next few hours. We sincerely apologise for any inconvenience this has caused.
As always, if you have any questions or concerns, please contact support.
Our investigation into why this server rebooted was inconclusive. We have, however, determined that it is currently running properly and not experiencing any problems. We’ll be monitoring key indicators to ensure that this should not repeat itself, as well as revising the way we monitor certain services.
Server NC018 was rebooted at 14:10 UTC on July 10th. After the reboot, the control panel was unavailable, but this was not immediately apparent. The server was again rebooted at 23:17 UTC and the control panel is once again available.
There are a couple of unresolved issues surrounding this incident, and we are working to determine the cause or causes, as these were not scheduled events. When we have further information we will post it here.
In the meantime, if you run into any issues, with the control panel or anything else, please contact support. Thanks for your patience.
Systems at a Glance:
| Loc. | System | Status | Ping |
|---|---|---|---|
| NC023 | Internal | Up? | |
| NC028 | Internal | Up? | |
| NC031 | Internal | Up? | |
| NC033 | Operational | Up? | |
| NC034 | Internal | Up? | |
| NC035 | Operational | Up? | |
| NC036 | Operational | Up? | |
| NC040 | Internal | Up? | |
| NC041 | Operational | Up? | |
| NC042 | Operational | Up? |
Subscriptions:
Search:
Recent Posts:
Archives:
Categories:
Links
Tags:
Resources:
On NinerNet: