8-10-2020 Outage Update

 

Dear Cidi Labs Customer Community,


We know many of you experienced yesterday’s slow performance and intermittent availability issues related to DesignPLUS and, as it turns out, also ReadyGO. We are deeply sorry for any inconvenience these issues may have caused you during this busy time. The good news is that the performance issue has been resolved. The bad news is that it didn’t need to take as long to resolve as it did. 


Summary 

On Monday, August 10, 2020, shortly after 12:00pm MDT, Cidi Labs tools began experiencing very slow load times, which eventually led to application outages. The root cause of these issues was an overloaded database server. The AWS database server maintained a 100% CPU utilization for more than 5 hours, causing extremely delayed response times. After additional review, it was discovered that the database server had been approaching 100% CPU utilization during peak hours for over a week, which had caused slow application load times as well. Please note that this issue never impacted the content styling aspects (CSS and JavaScript) that DesignPLUS provides.


Unfortunately, this relatively minor issue was exacerbated by the fact that our lead developer in charge of our infrastructure was out of the office and unexpectedly unreachable during the application outage, leading to longer than usual issue resolution. In the end, this was more a failure in organizational redundancy and execution of our continuity plan, than a product failure, which is unacceptable. 


Mitigation Plan

To ensure that this type of service outage does not occur in the future, Cidi Labs will be taking the following steps:

  • Update business continuity plans to provide additional emergency contact methods for the Cidi Labs team.

  • Update disaster recovery documentation to include recovery plans for additional scenarios, especially regarding the database server.

  • Train Cidi Labs development team and management to ensure redundancies exist in disaster recovery procedures.

  • Practice/simulate our enhanced plans.

  • Update technical documentation to ensure that our third-party support partner, Rackspace, has the application documentation it needs to step in to resolve issues in place of our own personnel if that’s ever needed again. 

  • Fix our status page (status.cidilabs.com) to accurately reflect the status of our applications.

  • Conduct a thorough analysis of the Cidi Labs applications for database optimization. This process has already begun.


Again, we offer our apologies for any impact this caused and we commit to improving our service as a result of what we learned, in order to maintain your trust in our products. 


Thanks,

The Cidi Labs team