Server NC020 (widely known as the the relay server) has been taken offline while we investigate reports of spam being sent through the server. We will post an update within the hour.
Server NC020 (widely known as the the relay server) has been taken offline while we investigate reports of spam being sent through the server. We will post an update within the hour.
The IP address of server NC020 has been removed from the Spamhaus Policy Block List. All email issues are resolved, and server configurations will return to normal.
If you have any questions concerning this incident, please contact support. We appreciate your patience.
While the IP address of server NC020 is still in the Spamhaus PBL, we have made a number of configuration changes to all of our mail servers which address three of the four issues mentioned in our original post on this matter:
Server NC020 is also used for some custom mailing lists. Email sent to these lists will also be handled as detailed in the first bullet point above.
We still anticipate that the IP address of server NC020 will be removed from the PBL at some point, at which point the configurations of all mail servers will revert to what they were before this incident. Further details will be posted here as necessary.
Server NC020’s IP address is still in the Spamhaus PBL. We have been unable to make any progress on having it removed yet. This may be because it’s the weekend, or it may be that we’re just one of thousands of companies that have been affected by this action, and so we just have to be patient. In the meantime, we are trying to have our case looked at sooner rather than later.
We’re also going to take steps to have mail from server NC020 routed through NC018 as a temporary measure. We will post here again when this has been put into place.
Within the last few hours, the IP address of server NC020 has been added to the Spamhaus Policy Block List (PBL). This is well-known, major and useful anti-spam block list operated by a reputable anti-spam organisation, and we ourselves use this block list to protect our clients from spam. However, this affects our clients in four ways:
We will, of course, post further updates here as we learn more and as the situation progresses. We appreciate your patience.
(This post has been edited to correct the date of the outage from September to October. Apologies for any confusion.)
On 20 October at approximately 19:37 UTC, server NC018 — on which most clients and services are hosted — became unresponsive. This was immediately noticed by automated monitoring systems in the data centre, and staff at the data centre went to physically check the server. As a result, the server was rebooted. However, the server again became unresponsive a few minutes later, and was again rebooted. After the third attempt at rebooting, it became clear that the cause of the problem was likely a hardware issue. (Malicious activity was ruled out fairly quickly based on network activity.)
The server was then taken offline and examined with the intent of replacing the processor and memory as likely causes of the problem. During the examination and replacement it became clear that two memory slots on the motherboard had failed. As a result, the motherboard, the memory and the power supply were replaced with new parts. The server was then placed back on the network, powered up, and has been up since 21:57 UTC.
We have a number of mechanisms in place to communicate with clients, specifically on operational issues such as this one:
As server NC018 is our primary server, options 1 (status page), 2 (email) and 4 (website notices) were not available; option 3 (Twitter and Identi.ca) was not used soon enough; option 5 (phoning clients) would only be used as a last resort in the event of a significant and extended outage of some sort. However, we did, of course, answer or return phone calls that we received.
Please see below for what we intend to do to improve communication during events like this.
We already take a number of proactive steps to do our best to prevent this sort of outage from occurring, but it’s simply inevitable that these sorts of events occur. However, that doesn’t mean we can’t do more, and it doesn’t mean we don’t learn from any event that does happen. This is what we already do:
This is only an overview of a few of the many activities we undertake to ensure that you shouldn’t have to worry about your hosting. Within and in addition to each of these brief summaries are other more minor activities and considerations that all dovetail to provide you with the best hosting experience and value for your money.
It’s possible, yes. Almost a month ago we had a similar problem with this server, but rebooting it once brought the server back online with no apparent problems. We and staff at the data centre tried to find the cause, but there was nothing to point to one. Had it occurred to someone at the time that the problem might have been caused by hardware, we’d have had to take the server offline anyway just to investigate, never mind fix the problem. With no indication of a hardware problem — even had the thought occurred to someone — there was no reason to take the server offline just because it might have led to a eureka moment. Doing so might have resulted in the same amount of downtime — albeit at a more convenient time — but with the chance that nothing would have been found thereby making the downtime unnecessary.
All of that said, there is more we can do:
Setting up these items will take some time, especially the last item, but we will report on our progress. Communication is something we’ve given considerable thought recently, but it’s important for us to find a balance between overwhelming you with “spam” and being so quiet that you forget who we are. Right now I think we’re on the latter end of the scale, but we definitely don’t want to be on the other end.
We’re in good company when it comes to downtime. Even the biggest operations that spend the equivalent of the GDP of small countries on infrastructure have their problems. Causes include mechanical failure, human error, human malice, weather, software bugs and even real live bugs! (In fact, an old urban legend says that the term “bug”, as applied to software, came from the discovery of an actual bug — a moth — in the workings of a computer that was generating errors.) Twitter has become one of the most well-known online brands in the last couple of years, while being notorious for downtime — ironic, considering we use it, but we use a second service for a reason. (Of course, there you get what you pay for.) Fifty-five million people in the eastern United States and Canada suffered between several hours and two days of electrical downtime in 2003. Google, Amazon, YouTube, Barclays Bank, MySpace, Facebook, PayPal, Microsoft, eBay — the list goes on — are all among a list of big-name companies that have all experienced news-breaking downtime measured in hours, not just minutes. (Just do a web search for the word “outage” and the name of any big company.) Have a Blackberry? Do you realise that all Blackberry emails in the whole world go through one data centre in central Canada, and if that data centre has a problem, you can still use your Blackberry for a paperweight? Nobody is immune; nobody gets away unscathed. (For some relevant articles and company names, see the links at the end of this post… if you’re curious and have the time.)
In all seriousness, the point is not to deflect attention from our own problem yesterday or to say that there’s no point in trying to prevent downtime because it will happen anyway, even to those who spend lots of time and money trying to outrun it. The point is that, despite the best of intentions, these things happen, and they happen whether you’re hosting with a small hosting company or a large hosting company. I think what differentiates NinerNet from our competition is that we realise you’ve trusted us with something important to you, we know you pay us a little more than the bargain basement hosting companies, and we treat you accordingly.
This is our first major service interruption since January 2005. Nobody ever wants to be down for even a few seconds, but the reality is… “stuff” happens. We do our very best to make sure it doesn’t, but it does, and we appreciate your patience and understanding when it does.
For the record, here is a log of our uptime for August, September and October (the latter extrapolated to a full month):
September falls within acceptable guidelines; October does not.
You can read more about uptime on the HostMySite.ca website.
I hope I’ve given you enough information to fully understand what went wrong yesterday, why it went wrong, how we’ll work diligently to try to ensure it doesn’t happen again, and what we’ll do to keep you informed when the inevitable happens. We know how you feel when things don’t work; after all, we were down too.
If you have absolutely any questions, concerns, comments or even brickbats, please contact us to let us know. Thank-you.
Craig Hartnett
Here are some articles to back up the statements above about high-profile outages (and to put things in perspective), in case you’re curious, starting with the most recent (just over a week ago) and going back to 2008. The quotes (to which I’ve added emphasis of a few points) give a good overview of the thrust of the articles:
Here is a general article covering outages and downtime:
Server NC018 was offline between approximately 19:37 and 21:57 UTC on 20 September October. This was due to an issue with the memory slots on the motherboard. The motherboard, the memory and the power supply were all replaced with new parts, and the server is back online.
We will be posting a full accounting of this incident within the next few hours. We sincerely apologise for any inconvenience this has caused.
As always, if you have any questions or concerns, please contact support.
Server NC018 became unresponsive at approximately 18:53 UTC. Monitoring systems alerted us to this fact, but remote access to the server was not possible. The data centre was contacted and the server rebooted. It was back online at 19:01 UTC. We’re looking into the causes of this problem.
The busy website on server NC020 has been moved to its own server, so the server is back to normal loads and responds much more quickly. We apologise for the temporary inconvenience.
The performance of server NC020 is degraded at the moment due to a very busy website. We will be moving this website to a new server within the next 24 hours. We appreciate your understanding while we work with this client to keep their website up.
The primary service affected on this server is the mail relay. If you are having significant problems sending email through the relay server, please contact support.
Systems at a Glance:
Loc. | System | Status | Ping |
---|---|---|---|
NC023 | Internal | Up? | |
NC028 | Internal | Up? | |
NC031 | Internal | Up? | |
NC033 | Operational | Up? | |
NC034 | Internal | Up? | |
NC035 | Operational | Up? | |
NC036 | Operational | Up? | |
NC040 | Internal | Up? | |
NC041 | Operational | Up? | |
NC042 | Operational | Up? |
Subscriptions:
Search:
Recent Posts:
Archives:
Categories:
Links
Tags:
Resources:
On NinerNet: