NinerNet Communications™
System Status

Server and System Status

NC036: Post-mortem following mail server issue in early June, and explanation of late invoicing

7 July 2025 07:12:58 +0000

As predicted immediately after the June mail server issue (that started on 11 June UTC), problems continued and new problems cropped up, delaying this post-mortem. The two primary results of this were that a relatively simple issue on the mail server that would normally have been addressed before it was even noticed by anyone was not addressed when it should have been, and the second was that our invoices that should have been sent on 15 June have still not gone out in early July! (It’s not unusual for our invoices to go out a couple of days late, but over half a month late is extreme.)

The primary issue on the mail server was that the disk drive that stores our clients’ email was about to fill up. This is a relatively routine occurrence that is addressed with the data centre and on the server in literally minutes in a two-step process: We buy new disk space in our data centre control panel, and then we configure the mail server to use that disk space.

Concurrent with the mail server issue, following fairly routine maintenance on my desktop machine, I could no longer log into it. This was traced to a configuration choice I had made in the maintenance that resulted in the main drive on my machine filling up; instead of the free space on the drive being overwritten with 1’s and 0’s or random data and being classed as free space, it was overwritten with data that looked like real data that could not be overwritten, deleted or re-classed as empty free space. The result was that the installation drive was full and I could not log in. This denied me access to data on my computer, namely a key that I need to log into the mail server to complete the second step listed in the previous paragraph. Very early on 13 June (UTC) I was able to access the encrypted drive on which the key was stored, log into the mail server and reconfigure it to use the additional space, and the problem on the mail server was fully resolved literally seconds later.

While that addressed the problem on the mail server, that, however, was the last time I was able to access the encrypted drive.

While I had access to the encrypted drive I stupidly saved files to it that I had been saving on flash drives. This wasn’t really “stupid”; it was a completely reasonable decision as the hard drive has far more space than the growing collection of flash drives I was using temporarily, and I had access to the encrypted drive and didn’t expect to lose access. As it turns out, since I no longer have access to the encrypted hard drive and the files that I saved to it were not included in a daily back-up that is run when I log into the machine, they now seem to be lost forever. Those files, while important at the time, do not include any vital business files.

A little earlier than planned I started the replacement of the now-just-outdated operating system on my work machine. For reasons I still can’t explain, the new operating system was so incredibly slow that I could make a sandwich and a cup of tea between clicks. (That’s a lot of sandwiches in an eight-sixteen-hour work day!) Several days were spent troubleshooting that issue when it suddenly, for absolutely no reason and without any action on my part, started working properly. The next priority was, since I could no longer access the encrypted drive, recovering backed-up files from the most recent daily back-up. I started with recovering vital business files and was able to immediately contact delinquent clients who apparently don’t pay their invoices until they receive a reminder. Then I started restoring all of the remaining files, meaning I could move forward with June’s invoicing. However, the restoration failed part way through, so I have had to give up and start our billing before the middle of July comes!

Our invoices will be dated 30 June 2025, which I realise is a bit disingenuous, but it keeps them dated in June. The more important dates for invoices are the dates on which your services expire; you can pay your invoice as late as you want (keeping in mind what we have said often in the recent past about waiting until the last minute), but you just need to pay it before your service, domain or certificate expires.

As always, we do sincerely apologise for the disruption that has been caused. What we have learned from this are the following:

  • Keep back-up copies of certain data — i.e., server keys and invoicing records — in places that are more instantly accessible than where all of our other data is backed up en masse,
  • Implement alternative ways of logging into servers where they are available,
  • Implement a data-recovery process that is far quicker than the standard data-recovery process that our current back-up system employs, and
  • Figure out why and how the LUKS encryption with which our hard drive was encrypted failed, and ensure it never happens again.

The third item is already in progress, as we make a second attempt to recover our backed-up data; the fourth will have to happen over an extended period in the future with no goal date and no guarantee of success, but in the meantime the data we recover from our back-ups — that are intact and in place — will be saved in unencrypted form. (Technically this goes against the first point in the “data storage and transmission” section of our privacy policy, but if we cannot access our data, there’s no point in it being encrypted!) The first item will be implemented as part of getting our daily back-ups up and running again, and the second will be implemented where it can be at our earliest convenience.

Thank-you again for your noting this information that we take to ensure that we learn from our experiences where our existing systems have failed. Please advise if you have any questions or suggestions.


Update, 2025-07-09: Contrary to what was stated above about June invoices, we have decided not to send invoices in June — if that wasn’t blatantly obvious, now that it is July — and well be sending June and July’s invoices in July. Please see our post about this on our corporate blog, especially if you have any products or services scheduled to expire soon in July. Thank-you.

NC036: Update 4

13 June 2025 01:23:56 +0000

The issue on our primary mail server has finally been resolved, and all messages in the queue have been delivered. As expected, once we had access it only took a few seconds.

We will post a post-mortem in the next couple of days … hopefully. I can’t exaggerate the extent to which numerous unrelated events have piled on top of one another — even in the last few minutes! — to prevent an earlier resolution of this problem, and at this point I can’t predict whether or not more issues will prevent the posting of the post-mortem. However, I’m finally taking a breath, as this issue (amongst other things) is finally resolved.

I do once again extend my heartfelt apology for this incident, and I will do everything in my power to review the cascading failures — all not even related to the mail server itself! — that led to this not being resolved much, much sooner.

NC036: Update 3

12 June 2025 13:45:11 +0000

Words cannot express my frustration at this point. 🙁

It will be another few hours again before this situation can be resolved. It just cannot go beyond tonight, UTC. By that time my computer will be completely reset with a fully updated operating system installed.

Sorry.

NC036: Update 2

12 June 2025 09:53:21 +0000

Let me explain the situation we’re in. It’s an illustration of the fact that sometimes too much is, in fact, too much.

My primary workstation stopped working late Wednesday afternoon (UTC). It stopped working because I could not log in after performing a maintenance/security operation that I routinely run, but I ran it in a certain way that was sightly different to how I usually run it with no problems.

At about the same time I received a report from a client about a problem with the mail server. I received it by email (of course) which I read on my phone. I hadn’t seen anything similar before, so I asked him for screenshots. In the meantime I had an idea of what the cause of the problem could be based on monitoring I had done the day before, but without access to my workstation I could not log in and check and fix the problem … which would (and will) take all of about 60 seconds if I am correct. Reports and my experience since have almost confirmed my suspicions.

So, given the fact that it is the middle of the night where I am I cannot do anything until business hours, which will be about 06:00 local, 13:00 UTC.

My local workstation is, of course, fully backed up, so it’s not a problem of a loss of data. The “problem” is with the additional security on logging into the server which we have purposely put into place in order to protect our infrastructure and your email. Because of that I cannot log into the mail server from the machine I am currently using, and will only have access to the resources I require in the morning, local time.

I cannot apologise enough for this situation that we have caused. We will calculate a credit that will be applied to all invoices of clients who host their email with us.

In the meantime, we apologise but this issue will continue until about 13:00 UTC. At that time I should have access to the server to fully and permanently address the problem. I will post an update here, on the status blog, when this issue is resolved. My humble and sincere apologies once again.

NC036: Update 1

12 June 2025 07:54:49 +0000

We continue to work on resolving this issue. The problem we’re having has nothing to do with the server itself, but our access to it.

One thing we can tell you for now is that one of the issues you may encounter is that incoming messages to your accounts may be duplicated, which is something I’m certainly experiencing. It’s frustrating and annoying for me, so I assume it is for you as well. Again, we apologise.

NC036: Emergency maintenance explanation

2 September 2021 09:23:31 +0000

The emergency maintenance on server NC036 earlier today was to add additional disk space for email storage. Normally this maintenance is scheduled in advance, but today a number of factors combined to require us to take action immediately. We will work to avoid this situation in the future, but the cause of the situation was one of the aforementioned factors.

Thank-you for your patience. If you have any questions or concerns, please contact NinerNet support. Thank-you.

NC036: Server back online

2 September 2021 02:59:34 +0000

Server NC036 is back online. It was down for 6 minutes between 02:31 and 02:37 UTC. We’ll be posting additional information about this shortly.

NC036: Emergency maintenance

2 September 2021 02:33:05 +0000

Server NC036 (our primary mail server) will be undergoing emergency maintenance within the next few minutes. The server will be down for approximately half an hour.

We will post an update when we are done.

NC036: Post-mortem

21 September 2020 08:42:47 +0000

As noted in the previous two posts, there was a virus outbreak on server NC036 (the primary mail server) this morning. Apparently the machines associated with five email accounts on three domains were compromised, allowing criminals to use those accounts to send thousands of viruses. These were intercepted by our anti-virus scanner, but due to the volume of activity on the server we had to shut down the SMTP side of the mail server while we determined which email accounts were compromised, suspended them and removed their messages from the mail queue.

Please note that what happens in almost all cases when email accounts are compromised is that the computer (or one of the machines or devices on which those accounts are configured) is what is actually compromised; it is not the server. The account owner’s machine is usually infected with a virus or other malware, and the account’s password is then transmitted to the criminals behind the virus. They then launch an attack via the legitimate and correct password. It’s as if your car was stolen and the thief used it to commit a crime; the car behaved as it was told by the guy with the key, but is not responsible for the crime. On the other hand, the owner of the car may have left the key in their car and the door unlocked, contributing to the compromise. This is why it is vitally important that you have anti-virus software installed on your computer, and kept up-to-date.

If you have any questions about this, please feel free to contact NinerNet support, and we’ll be happy to answer your questions or concerns. Our apologies for the interruption.

NC036: Mail server is back online

21 September 2020 07:26:54 +0000

Our apologies. The sending side of the mail server (NC036) is back up. It was down for 21 minutes between 06:56 and 07:17 UTC. The ability to check your email account was not down.

We will post additional information and contact the affected clients shortly.

NinerNet home page

Systems at a Glance:


Loc.SystemStatusPing
Server NC023, London, United Kingdom (Relay server), INTERNAL.NC023InternalUp?
Server NC028, Vancouver, Canada (Monitoring server), INTERNAL.NC028InternalUp?
Server NC031, New York, United States of America (Web server), INTERNAL.NC031InternalUp?
Server NC033, Toronto, Canada (Primary nameserver), OPERATIONAL.NC033OperationalUp?
Server NC034, Lusaka, Zambia (Phone server), INTERNAL.NC034InternalUp?
Server NC035, Sydney, Australia (Secondary nameserver), OPERATIONAL.NC035OperationalUp?
Server NC036, Amsterdam, Netherlands (Mail server), OPERATIONAL.NC036OperationalUp?
Server NC040, Toronto, Canada (Web server), INTERNAL.NC040InternalUp?
Server NC041, New York, United States of America (Web server), OPERATIONAL.NC041OperationalUp?
Server NC042, Seattle, United States of America (Status website), OPERATIONAL.NC042OperationalUp?

Subscriptions:

RSS icon. RSS

Twitter icon. Twitter

Search:

 

Recent Posts:

Archives:

Categories:

Links

Tags:

.co.zm domains .com.zm domains .zam.co domains back-up bounce messages browser warnings connection issues control panel database dns dos attack dot-zm domains down time email email delivery error messages ftp hardware imap mail mailing lists mail relay mail server microsoft migration nameservers network networking performance php phplist pop reboot shaw shaw communications inc. smtp spam spamassassin ssl ssl certificate tls tls certificate viruses webmail web server

Resources:

On NinerNet: