Notification Service Latency

Performance
June 29, 1:59pm EST

Notification Service Latency

Status: closed
Start: June 29, 1:06pm EST
End: June 29, 1:59pm EST
Duration: 53 minutes
Affected Components:
Notification services
Update

June 29, 1:06pm EST

June 29, 1:06pm EST

StatusCast’s engineers were alerted that some incident notifications were either slowly being delivered or appeared to not get delivered at all. After an initial review our engineers determined that the service processing notifications was experiencing performance issues, resulting in a queued backlog of notifications.


At this time StatusCast’s engineers have scaled out this service to allow the backlog to clear itself up in as an efficient manner as possible. Notification processing at this is has returned to it’s normal state and we will continue to monitor this closely. A root cause analysis will be posted when more information has become available.

Resolved

June 29, 1:59pm EST

June 29, 1:59pm EST

At this time notification services are performing normally. 

Root Cause

June 29, 1:59pm EST

June 29, 1:59pm EST

Summary of impact: Between 1:06PM and 1:59PM EDT on 29 June 2020, some customers may have experienced latency with incident notifications getting delivered. All notification services were recovered by 1:59PM EDT.

Preliminary root cause: Engineers identified the underlying root cause as a server delegation change affecting DNS resolution and resulting in a backlog of notifications getting queued. This issue impacted a subset of StatusCast’s customers who were delegated to the server in question. Availability to status pages and the administrative portal remained at 100% throughout the incident

Mitigation: To mitigate, engineers corrected the server delegation issue. To expedite the processing of the server’s backlog, engineers scaled out the service to efficiently distribute the backlog of incidents. Once the backlog was cleared the service remained at its normal operating state.

Moving Forward: StatusCast is committed to providing its customers a highly reliable and available service. Anytime an issue is reported that potentially affects availability of the status page or the integrity of notification delivery, we treat it with the utmost urgency. Moving forward we have established new monitoring protocols for our notification system to ensure that latency created backlogs are properly reported to our engineering staff. We have also taken this as an opportunity to evaluate the current scale of this service and how we can improve upon the functionality.