Dear Customer,
As you will be aware we suffered a serious failure of our mass storage device. This device holds the email accounts and databases for many of our customers. It is built across two separate pieces of hardware and two different controllers to provide a redundant system.
At around 8.50pm on Wednesday evening the primary controller failed, and the secondary controller took over as planned. During the later part of the evening we noticed that the mail storage was not accessible even though the SAN was operational for other services. It transpired that the email storage had not transitioned across from the primary controller to the secondary. All attempts to make this change failed and at 11.15am on Thursday morning a decision was made to reboot the entire SAN to see if it would clear the issue.
What we did not realise was that after the full reboot the second controller would not come back online correctly either. This failure took our Hosted Microsoft Exchange and MS SQL service offline.
Within a couple of hours the decision was made to start making alternative arrangements by putting our disaster recovery program in place. This was first put in to action for MS SQL and customer data was started to be restored as requested and required. At the same time we started a build of an alternative email system to get email flowing as quickly as possible.
We have been working closely with SUN/Oracle, the vendor of the hardware, who have had an engineer working on our system.
At around 11.50am on Friday, all Hosted Exchange and SQL services were functional and working correctly.
The SUN/Oracle engineer continued to work on our storage device and by 2pm had managed to re-establish connection to the email store. Email is now flowing correctly, but as you can imagine service is slightly slower than normal due to the huge amount of email traveling through the system.
During the problems we have had, we continued to update our support news site at - http://sointernetnews.blogspot.com and for those using Twitter - @sointernet
I would like to apologise for the problem and inconvenience that has occurred, which has been the first of this size in 5 years. We believed we had a redundant system that could cope with a failure like this. We will be reviewing our implementation fully over the coming days and adjust our configuration as required.
Thank you again for your patience.
Friday, 10 June 2011
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment