Azure
Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).
Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 13 Oct 2023, however, progress shows we are expected to complete by 11 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.
Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).
Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 11 Oct 2023, however, progress shows we are expected to complete by 09 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.
Summary of Impact: Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).
Current Status: This issue is now mitigated. More information will be provided shortly.
What happened?
Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).
What do we know so far?
We identified a change in behavior of one of the Azure security monitoring services agents used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases, scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches.
How did we respond?
To address this issue, engineers performed manual action on the underlying Virtual Machines of impacted caches. After further monitoring, internal telemetry confirmed this issue is mitigated and full-service functionality was restored.
What happens next?
We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Summary of Impact: Between 20:30 UTC on 18 Aug. 23 and 05:10 UTC on 19 Aug. 23, you were identified as customer using Workspace-based Application Insights resources who may have experienced 7-10% data gaps during the impact window, and potentially incorrect alert activations.
Preliminary Root Cause: We identified that the issue was caused due to a code bug as part of the latest deployment which has caused some drop in the data.
Mitigation: We have rolled back the deployment to last known good build to mitigate the issue.
Additional Information: Following additional recovery efforts, we have re-ingested the data that was not correctly ingested due to this event, after further investigation it was discovered that the initial re-ingested data had incorrect TimeGenerated values, instead of the original TimeGenerated value. This may cause incorrect query results which may further cause incorrect alerts or report generation. We have investigated the issue that caused this behavior so future events utilizing data recovery processes will re-ingest the data with the correct, original TimeGenerated value.
If you need any further assistance on this, please raise the support ticket for the same.
Summary of Impact: Between 20:30 UTC on 18 Aug. 23 and 05:10 UTC on 19 Aug. 23, you were identified as customer using Workspace-based Application Insights resources who may have experienced 7-10% data gaps during the impact window, and potentially incorrect alert activations.
Preliminary Root Cause: We identified that the issue was caused due to a code bug as part of the latest deployment which has caused some drop in the data.
Mitigation: We have rolled back the deployment to last known good build to mitigate the issue.
Additional Information: Following additional recovery efforts, we re-ingested the data that was not correctly ingested due to this event, after further investigation it was discovered that the initial re-ingested data had incorrect TimeGenerated values, instead of the original TimeGenerated value. This may have caused incorrect query results which may have further caused incorrect alerts or report generation. Our investigation extended past previous mitigation and we were able to identify a secondary code bug that caused this behavior. We deployed a hotfix using our Safe Deployment Procedures that re-ingested the data with the correct, original TimeGenerated value. All regions are now recovered and previously incorrect TimeGenerated values are now corrected.
If you need any further assistance on this, please raise the support ticket for the same.
Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrence. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.
Current Status: We are actively investigating this issue and will provide more information within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.
Current Status: We are actively investigating this issue and will provide more information within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 855 complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 85% complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.
Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.
Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.
Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.
This incident is now mitigated. More information will be provided shortly.
Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.
This incident is now mitigated. More information will be provided shortly.
Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.
Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.
Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.
Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. . Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.
Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.
Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.
Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Impact Statement: Starting at 06:45 UTC on 26 Jun 2023 you have been identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.
Current Status: Based on our initial investigation, we have determined that a subset of AFD POPs became unhealthy and unable to handle the load of incoming requests, which in turn impacted Azure Front Door availability.
After a successfully simulation we are currently applying the potential mitigation workstream, by removing impacted instances from rotation while monitoring traffic allocations to remaining healthy cluster. The first set of impacted instances have been successfully removed at 15:30 UTC on 26 Jun 2023. Since the removal we have been seeing a reduction of errors and we will continue to monitor impact. We are currently working on removing the rest of the impacted instances and allocating the resources to a healthy alternative. Some customers may already begin to see signs of recovery. The next update will be provided in 60 minutes, or as events warrant.
Summary of Impact: Between 06:45 UTC and 17:15 UTC on 26 Jun 2023 you were identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.
This issue is now mitigated, more information on mitigation will be provided shortly.
Summary of Impact: Between 06:45 UTC and 17:15 UTC on 26 Jun 2023 you were identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.
Preliminary Root Cause: We found that a subset of AFD POPs were throwing errors and unable to process requests.
Mitigation: We moved the resources from the affected AFD POPs to healthy alternatives which returned the service to a healthy state.
Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
How to stay informed about Azure service issues
Summary of Impact: Between 20:33 UTC - 21:00 UTC on 10 Jun. 2023, customers in East US may have experienced impact on network communications, due to hardware failure of a router, during a planned maintenance. Retries would have been successful.
Preliminary Root Cause: We have determined that the router self-healed at 21:00 UTC.
Mitigation: We have isolated the device as a precaution and stopped further upgrades.
Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Between 03:30 UTC and 13:52 UTC on 16 Mar 2023 you were identified as a customer using Azure Redis Cache that may have experienced some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
This issue is now mitigated; more information will follow shortly.
We are investigating an alert for Azure Redis Cache. We will provide more information as it becomes available.
Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Current Status: We are aware of the issue and are actively investigating. The next update will be provided in 60 minutes or as events warrant.
Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Current Status: We have identified the cause of this to be an issue with a recent deployment. We are in the early stages of developing a hotfix to mitigate this issue. The next update will be provided in 1 hour or as events warrant.
Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Current Status: We have identified the potential root cause to be an issue with a recent deployment. We are in the early stages of developing a hotfix to mitigate this, in the meantime, we are looking into pausing the deployment to avoid further disruption. The next update will be provided in 2 hours, or as events warrant.
Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Current Status: We have identified the potential root cause to be a bug which was introduced during a recent deployment and is causing the unexpected failovers to occur. We are working to pause the deployment to avoid further disruption. There is no current workaround for this issue, however, once failover process is completed for all nodes of a cache resource, cache health should return to normal. We are also working on a hotfix that will be deployed in the coming days to resolve the underlying issue.
The next update will be provided in 2 hours, or as events warrant.
Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Current Status: We have identified the cause of this to be a bug which was introduced during a recent deployment and is causing the unexpected failovers to occur. We have put a pause on the deployment as a short term fix while we continue to work on developing the hotfix to address the underlying issue which we expect to mitigate fully this. There is no current workaround for this, however, once failover process is completed for all nodes of a cache resource, cache health should return to normal.
The next update will be provided in 2 hours, or as events warrant.
Summary of Impact: Between 03:30 UTC and 13:52 UTC on 16 Mar 2023 you were identified as a customer using Azure Redis Cache that may have experienced some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.
Preliminary Root Cause: We identified that a recent deployment introduced a bug, which led to the unexpected failovers, timeouts and connectivity issues mentioned above.
Mitigation: We mitigated this by stopping the deployment which was causing these issues, and can confirm that this has mitigated impact for customers. We are still in the process of rolling out the hotfix as a long term fix.
Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
Summary of Impact: Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI.
Preliminary Root Cause: We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, as well as connectivity between services in different regions, as well as ExpressRoute connections.
Mitigation: We identified a recent change to WAN as the underlying cause and have rolled back this change. Networking telemetry shows recovery from 09:00 UTC onwards across all regions and services with the final networking equipment recovering at 09:35 UTC. Most impacted Microsoft services automatically recovered once network connectivity was restored, and we worked to recover the remaining impacted services.
Next Steps: We will follow up in 3 days with a preliminary Post Incident Report (PIR), which will cover the initial root cause and repair items. We'll follow that up 14 days later with a final PIR where we will share a deep dive into the incident. You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts (https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation) and you will be notified via your preferred communication channel(s).
Summary of Impact: Between 7:08 UTC and 12:30 UTC on 25 Jan 2023, you were identified as a customer in Canada East, East US, South Central US, West US, and Canada Central who may have experienced latency or timeouts when deploying networking services through the Azure portal.
Preliminary Root Cause: We determined that our services were affected by a network latency brought about by a networking router that was taken out of rotation for maintenance during a spike in traffic. This led to increased congestion on some links.
Mitigation: The router was removed from service, and optimized the routes for the flow of traffic so services could resume normally.
Next Steps: A full root cause to investigate why this router caused wide spread impact will be developed. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.
StatusCast is deeply committed to the protection and responsible use of data; we’ve made a few updates to our privacy policy to help you better understand how we collect and use information. We hope these changes more clearly articulate our responsibility to you.
By continuing to use our services on or after March 25, 2020, you indicate your agreement with each of the updated and new terms and policies that are effective as of that date. If you have any questions or concerns, please email support@statuscast.com or reply to this email.