2-11-08
3:30 pm
The e-mail problem was first discovered when LDAP was inaccessible by the central helpdesk staff at approximately 3:30pm on 2-11-08. LDAP (Lightweight Directory Access Protocol) is used for user directory and authentication for a number of services including email and wireless access. After testing the system and troubleshooting various methods of connecting to the LDAP server, we were able to confirm that the system was not operating properly.
At 4:30pm the problem was called in to Sun Service. After several conversations with them regarding the error messages we were noticing on the system, it was determined that the unit was experiencing hardware failures. Sun Service scheduled a field service call with a replacement disk drive and a Raid controller for the following day at noon. (The college’s service agreement calls for 24 hour response time.) After testing the system and troubleshooting various methods of connecting to the LDAP server, we were able to confirm that the system was not operating properly.
12 noon
The SUN technician arrived the following day at approximately 12 noon. After performing several pre-boot diagnostic tests and examining the remaining drives, the source of the error messages was found to be the from the primary drive on which the OS (Operating System) is located, and is required in order to boot the server.
After reconfiguration and reinstallation of the OS, the RAID configuration was tested to ensure that a failure of one of the drives would not disrupt the server operations in the future.
The impact of the LDAP failure was not immediately noticed by locally-assigned user accounts due to password caching feature employed by the Mirapoint mail server. This was not the case with alias accounts assigned to users in the School of Engineering and the Division of Science because of the function that LDAP performs for alias accounts. All alias accounts are routed via LDAP to e-mail servers in the School of Engineering and the Division of Science. Once the LDAP system was completely offline, users in these two areas started to experience bounced messages that were being sent to them. This is why users in the School of Engineering and the Science division were not able to receive e-mail once the LDAP computer server was completely offline later that day for servicing.
The LDAP Computer Server (SUN) was brought back online and the OS was reinstalled later that evening, 2-12-08. The backup LDAP data was then restored to the system. after noticing performance issues with users’ ability to login, we performed several diagnostic tests into the late evening and early morning hours that indicated the system was still not functioning properly. After conferring with Sun technical support about the replacement drive, and getting confirmation from them that the drive was not producing any error messages, the system hardware itself was ruled out as a possible source of the problem. Sun technical support recommended that we contact our e-mail hardware provider as the possible source of the problem. After performing several diagnostic tests on the Mirapoint hardware server, resulting in no errors from the hardware itself, it was recommended by Mirapoint tech support that we re-index the data due to the possibility of large blocks of data being corrupted. Once re-indexing was completed, the system’s performance improved immediately, which allowed all users to login without any further problems.
2-13-08
4:00pm
The system was completely restored at this point, and the LDAP data was checked to ensure there was no further corruption around 4pm, on 2-13-08.
In summary, after experiencing such a major disaster with our e-mail server and the impact it had on the college community’s ability to communicate for an entire day, the following plans and recommendations are being implemented now and will continue to be until we have a fully redundant system in place that will not cause a total system meltdown in the future.
The following has been implemented since this incident:
•Two (2) LDAP servers have been setup and are already online and production
•The college has a second Mirapoint e-mail server currently offline, but is scheduled to be setup in mid March. Once configured, the unit will function as a hot-spare in the event of failure with the primary e-mail server.
•The college is also in possession of a second NetApp storage appliance that will keep a live copy of users’ e-mail data for locally-assigned user accounts. This device will store replicated copies of e-mails that are stored on the production system. Scheduled for implementation March 8, 2008
The following recommendations are being evaluated with the possibility of implementation in the near future:
•An e-mail queuing appliance that will que e-mail in the event of e-mail delivery problems that go beyond the normal standard delivery time frame or e-mail hardware failure
•Reviewing our email environment with an e-mail consultant that is familiar with disaster planning and implementation
•Hiring an external service company to monitor critical servers and the network during the evenings, weekends and holidays
