Agile Availability Issues Update
You have experienced several Agile outages over the last couple of weeks. Our team is continuing to work on resolving the problem, which is happening within our core network infrastructure. Agile crashed on January 18th, January 23rd, twice on January 24th, and again this morning (January 31st). In each instance, we contained the issue. We now have a playbook for containing the issue if it happens again, including a monitoring tool that provides “early warning” for this type of event. This morning’s outage was contained to 30 minutes of downtime. We apologize for any inconvenience this may have caused you.
Please find a full explanation of the issues and the corrective action below.
On Thursday, January 18th, at approximately 10 AM EST, the network connection between our application servers and data storage became unstable, causing a full Agile systems outage. It took us four hours to stabilize the network and provide access to Agile. All associates were back on by 2:30PM. We executed our root cause countermeasure (RCCM) process, which included evaluating data logged from our network switches and evaluating recent changes in our network topology. We thought a recent change in our network topology was not fully “understood” by our core switches. Its memory needed to be refreshed through a recycling of each of the switches that make up the core network. The switches have run uninterrupted with 100% uptime since January 2014, so we waited until Sunday to do the reboot to minimize the impact if any switches failed during the process. The Sunday recycle was successful. We felt we had corrected the problem.
On Tuesday, January 23rd, it happened again! This time at 7 AM EST, and the outage lasted 90 minutes. On Tuesday night, we worked with our vendors to evaluate logs and update switch configurations.
On Wednesday, January 24th, it happened again around 7 AM EST. By now, we have gotten better at containment and were back up in about 30 minutes. Later in the day, around 6 PM EST, the Agile database crashed. It was down for an hour. This crash was due to instability caused by the earlier network outage, so I’m blaming the network again. Our switch vendor identified a memory leak prior to each instability, and recommended we install new firmware on our switches. The installation was completed on Wednesday evening, and we ran stable for 6 days until this morning.
This morning we experienced another outage around 8:30AM EST, and had Agile back on line by 9:00AM EST. We captured an enormous amount of data before and during the outage event that we are now analyzing with our data storage and network switch vendors. I will keep you posted through this blog on further corrective actions as we learn more.
Regards,
Pat Quinn
SVP, Information Systems and Technology
Is there an approximate timeframe in which Agile will work on browsers other than IE?
Thanks for all the info Pat. This must be a little frustrating for the team. Good work ! I have complete confidence that it will all be correct and stable very soon.