Server NC020 was rebooted at 14:34 UTC after it was reconfigured with more RAM.
Server NC020 was rebooted at 14:34 UTC after it was reconfigured with more RAM.
While we have not heard from AOL yet, in recent experiments all email from AOL got through. However, these experiments were on a very small scale, so may not accurately reflect reality.
Domains hosted by Realtime appear to be back up again after being down for about five days. That email backlog has been cleared.
Email to the iwayafrica.com domain appears to be having problems. There isn’t much of this backlogged on the mail servers, but there is some. Same with a couple of domains hosted by Coppernet.
If you’re trying to communicate with people with unreliable email, why not refer them to NinerNet? We pay referral fees!
We have had some reports of incoming email from AOL users being bounced back to the senders. After some extensive research over the last few days we believe we have managed to address this problem, adjusting the configuration of one of our anti-spam measures (greylisting) to allow for the fact that AOL’s handling of outbound email doesn’t play well with greylisting.
If you are a NinerNet client and have any reports from correspondents with AOL email addresses indicating that email from them to you has been returned as undeliverable, please contact support.
Update 2010-06-15 07:28 UTC: Apparently this issue has not yet been resolved. We contacted AOL yesterday to find out why some of their mail servers are bouncing email prematurely and without reason, but their auto-response says it could take seven days or more for them to respond, it at all. We will post updates as they become available.
Clients sending email to domains hosted by Realtime are advised that messages to these domains have been backing up on our mail servers over the weekend, and are still not being delivered. A cursory analysis would seem to indicate that there is a routing problem at or near the Realtime mail servers. If you have concerns about this, please contact the intended recipients via a medium other than email and have them contact Realtime for an explanation.
Update 2010-06-15 07:26 UTC: This issue seems to be ongoing. Domains hosted by Realtime appear to be down still, and email to them continues to be queued on our mail servers.
We’ve had increasing numbers of reports from Windows users that they keep being disconnected from the server when trying to upload files via FTP. We’ve verified that this is a problem when trying to FTP files using a Windows XP machine, using both the native, command-line FTP client and third-party GUI clients. However, when we’ve tested FTP using Mac OS X and Linux machines, there are no problems uploading files via FTP. This indicates that the problem is not on the server, and narrows the problem down to people who use Windows XP at least, but perhaps other versions of Windows too. The sudden appearance of this problem seems to suggest that perhaps a recent Windows Update has changed something that is causing this problem.
We were able to overcome this problem on the Windows XP machine we used for testing by adding the FTP program to the exception list in Windows Firewall. Here’s how you do it:
Below is a picture of what the “Exceptions” tab looks like on the Windows XP test machine, showing two FTP programs: the native “File Transfer Program” that comes with Windows, and FileZilla:

Windows Firewall "Exceptions" tab.
If you have problems uploading files to your website via FTP, please try adding your FTP program to the Windows Firewall exceptions list as described above. If you have any questions, please contact support. Thanks.
The IP address of server NC020 has been removed from the Spamhaus Policy Block List. All email issues are resolved, and server configurations will return to normal.
If you have any questions concerning this incident, please contact support. We appreciate your patience.
While the IP address of server NC020 is still in the Spamhaus PBL, we have made a number of configuration changes to all of our mail servers which address three of the four issues mentioned in our original post on this matter:
Server NC020 is also used for some custom mailing lists. Email sent to these lists will also be handled as detailed in the first bullet point above.
We still anticipate that the IP address of server NC020 will be removed from the PBL at some point, at which point the configurations of all mail servers will revert to what they were before this incident. Further details will be posted here as necessary.
Server NC020’s IP address is still in the Spamhaus PBL. We have been unable to make any progress on having it removed yet. This may be because it’s the weekend, or it may be that we’re just one of thousands of companies that have been affected by this action, and so we just have to be patient. In the meantime, we are trying to have our case looked at sooner rather than later.
We’re also going to take steps to have mail from server NC020 routed through NC018 as a temporary measure. We will post here again when this has been put into place.
Within the last few hours, the IP address of server NC020 has been added to the Spamhaus Policy Block List (PBL). This is well-known, major and useful anti-spam block list operated by a reputable anti-spam organisation, and we ourselves use this block list to protect our clients from spam. However, this affects our clients in four ways:
We will, of course, post further updates here as we learn more and as the situation progresses. We appreciate your patience.
(This post has been edited to correct the date of the outage from September to October. Apologies for any confusion.)
On 20 October at approximately 19:37 UTC, server NC018 — on which most clients and services are hosted — became unresponsive. This was immediately noticed by automated monitoring systems in the data centre, and staff at the data centre went to physically check the server. As a result, the server was rebooted. However, the server again became unresponsive a few minutes later, and was again rebooted. After the third attempt at rebooting, it became clear that the cause of the problem was likely a hardware issue. (Malicious activity was ruled out fairly quickly based on network activity.)
The server was then taken offline and examined with the intent of replacing the processor and memory as likely causes of the problem. During the examination and replacement it became clear that two memory slots on the motherboard had failed. As a result, the motherboard, the memory and the power supply were replaced with new parts. The server was then placed back on the network, powered up, and has been up since 21:57 UTC.
We have a number of mechanisms in place to communicate with clients, specifically on operational issues such as this one:
As server NC018 is our primary server, options 1 (status page), 2 (email) and 4 (website notices) were not available; option 3 (Twitter and Identi.ca) was not used soon enough; option 5 (phoning clients) would only be used as a last resort in the event of a significant and extended outage of some sort. However, we did, of course, answer or return phone calls that we received.
Please see below for what we intend to do to improve communication during events like this.
We already take a number of proactive steps to do our best to prevent this sort of outage from occurring, but it’s simply inevitable that these sorts of events occur. However, that doesn’t mean we can’t do more, and it doesn’t mean we don’t learn from any event that does happen. This is what we already do:
This is only an overview of a few of the many activities we undertake to ensure that you shouldn’t have to worry about your hosting. Within and in addition to each of these brief summaries are other more minor activities and considerations that all dovetail to provide you with the best hosting experience and value for your money.
It’s possible, yes. Almost a month ago we had a similar problem with this server, but rebooting it once brought the server back online with no apparent problems. We and staff at the data centre tried to find the cause, but there was nothing to point to one. Had it occurred to someone at the time that the problem might have been caused by hardware, we’d have had to take the server offline anyway just to investigate, never mind fix the problem. With no indication of a hardware problem — even had the thought occurred to someone — there was no reason to take the server offline just because it might have led to a eureka moment. Doing so might have resulted in the same amount of downtime — albeit at a more convenient time — but with the chance that nothing would have been found thereby making the downtime unnecessary.
All of that said, there is more we can do:
Setting up these items will take some time, especially the last item, but we will report on our progress. Communication is something we’ve given considerable thought recently, but it’s important for us to find a balance between overwhelming you with “spam” and being so quiet that you forget who we are. Right now I think we’re on the latter end of the scale, but we definitely don’t want to be on the other end.
We’re in good company when it comes to downtime. Even the biggest operations that spend the equivalent of the GDP of small countries on infrastructure have their problems. Causes include mechanical failure, human error, human malice, weather, software bugs and even real live bugs! (In fact, an old urban legend says that the term “bug”, as applied to software, came from the discovery of an actual bug — a moth — in the workings of a computer that was generating errors.) Twitter has become one of the most well-known online brands in the last couple of years, while being notorious for downtime — ironic, considering we use it, but we use a second service for a reason. (Of course, there you get what you pay for.) Fifty-five million people in the eastern United States and Canada suffered between several hours and two days of electrical downtime in 2003. Google, Amazon, YouTube, Barclays Bank, MySpace, Facebook, PayPal, Microsoft, eBay — the list goes on — are all among a list of big-name companies that have all experienced news-breaking downtime measured in hours, not just minutes. (Just do a web search for the word “outage” and the name of any big company.) Have a Blackberry? Do you realise that all Blackberry emails in the whole world go through one data centre in central Canada, and if that data centre has a problem, you can still use your Blackberry for a paperweight? Nobody is immune; nobody gets away unscathed. (For some relevant articles and company names, see the links at the end of this post… if you’re curious and have the time.)
In all seriousness, the point is not to deflect attention from our own problem yesterday or to say that there’s no point in trying to prevent downtime because it will happen anyway, even to those who spend lots of time and money trying to outrun it. The point is that, despite the best of intentions, these things happen, and they happen whether you’re hosting with a small hosting company or a large hosting company. I think what differentiates NinerNet from our competition is that we realise you’ve trusted us with something important to you, we know you pay us a little more than the bargain basement hosting companies, and we treat you accordingly.
This is our first major service interruption since January 2005. Nobody ever wants to be down for even a few seconds, but the reality is… “stuff” happens. We do our very best to make sure it doesn’t, but it does, and we appreciate your patience and understanding when it does.
For the record, here is a log of our uptime for August, September and October (the latter extrapolated to a full month):
September falls within acceptable guidelines; October does not.
You can read more about uptime on the HostMySite.ca website.
I hope I’ve given you enough information to fully understand what went wrong yesterday, why it went wrong, how we’ll work diligently to try to ensure it doesn’t happen again, and what we’ll do to keep you informed when the inevitable happens. We know how you feel when things don’t work; after all, we were down too.
If you have absolutely any questions, concerns, comments or even brickbats, please contact us to let us know. Thank-you.
Craig Hartnett
Here are some articles to back up the statements above about high-profile outages (and to put things in perspective), in case you’re curious, starting with the most recent (just over a week ago) and going back to 2008. The quotes (to which I’ve added emphasis of a few points) give a good overview of the thrust of the articles:
Here is a general article covering outages and downtime:
Systems at a Glance:
| Loc. | System | Status | Ping |
|---|---|---|---|
| NC023 | Internal | Up? | |
| NC028 | Internal | Up? | |
| NC031 | Internal | Up? | |
| NC033 | Operational | Up? | |
| NC034 | Internal | Up? | |
| NC035 | Operational | Up? | |
| NC036 | Operational | Up? | |
| NC040 | Internal | Up? | |
| NC041 | Operational | Up? | |
| NC042 | Operational | Up? |
Subscriptions:
Search:
Recent Posts:
Archives:
Categories:
Links
Tags:
Resources:
On NinerNet: