Incident Summary:
On October 11, 2017, the GeorgiaVIEW Online Learning QPROD environment began experiencing periodic disruptions of service due to high CPU usage and application errors. This negatively impacted performance for USG Institutions accessing their instances of D2L hosted on our server. These disruptions include the following dates and times:
-October 11, between 10:30 a.m. and 12:47 p.m.
-October 18, between 10:32 a.m. and 12:41 p.m.
-October 24, between 10:18 a.m. and 3:52 p.m.
-October 24, between 7:54 p.m. and 11:01 p.m.
-October 25, between 3:02 p.m. and 5:01 p.m.
-October 26, between 10:51 a.m. and 12:46 p.m.
In response to each disruption, ITS and the vendor (D2L) successfully restored the system to normal operation.
Because we recognize that interruptions of GeorgiaVIEW service impact institutions across the state, we are communicating this post-outage analysis of what occurred and the measures being taken to address the factors resulting in this incident.
Incident Cause:
Ongoing, joint investigations between ITS and D2L Support have discovered an application query that occasionally results in a bad query plan being executed. This bad query resulted in significant spikes in CPU utilization which significantly decreased system performance and application errors which impacted system availability.
Incident Response Measures:
On October 26, 2017, D2L instituted a technical fix that ensures a valid query plan remains in place on QPROD. This fix will remain in place as D2L develops a code-based solution to address the underlying issue. This solution will be deployed into production at a future date. Additional information regarding the deployment schedule will be communicated to GeorgiaVIEW customers as they become available.
In addition to working with D2L, we are also reviewing internal ITS processes to improve internal/external communications and reduce incident response times.