Major Microsoft Azure Outage Affecting the GoFormz Services

Incident Report for GoFormz

Postmortem

https://azure.status.microsoft/en-us/status/history/

The following information was shared by Microsoft Azure following this incident.

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 15:45 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) may have experienced latencies, timeouts, and errors.

Affected Azure services include, but are not limited to: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Azure Virtual Desktop, Container Registry, Media Services, Microsoft Copilot for Security, Microsoft Defender External Attack Surface Management, Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management UX), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), and Video Indexer.

Customer configuration changes to AFD remain temporarily blocked. We will notify customers once this block has been lifted. While error rates and latency are back to pre-incident levels, a small number of customers may still be seeing issues and we are still working to mitigate this long tail. Updates will be provided directly via Azure Service Health.

What went wrong and why?

An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

How did we respond?

15:45 UTC on 29 October 2025 – Customer impact began.

16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.

16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.

16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.

16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.

17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.

17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.

17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.

18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.

18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.

23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.

00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail and will share findings within 14 days. Once we complete our internal retrospective, generally within 14 days, we will publish a final Post Incident Review (PIR) to all impacted customers.

Posted Oct 30, 2025 - 08:27 PDT

Resolved

This incident has been resolved. A post-mortem provided by Microsoft Azure will be posted shortly.

Posted Oct 30, 2025 - 08:25 PDT

Monitoring

Microsoft Azure has implemented a fix, and all GoFormz services are now operating as expected. We are continuing to monitor and will share the Azure post-mortem when available.

Posted Oct 29, 2025 - 14:00 PDT

Identified

After investigating the impact of the Microsoft Azure outage, we have moved Automated Workflows, Mobile Apps, Email Delivery, and API services to Operational. These services should be functioning as expected.

Web App, Login, and Public Forms services are still experiencing a major outage.

Microsoft Azure is continuing to investigate and implement a fix to address this outage. For further details, please see the latest update from Microsoft Azure: https://azure.status.microsoft/en-us/status

We will continue to monitor the situation and provide updates as more information becomes available.

Posted Oct 29, 2025 - 10:25 PDT

Investigating

We are currently experiencing a major outage impacting all Web GoFormz services. This disruption is due to a widespread Microsoft Azure outage, which is affecting our infrastructure and related services.

The GoFormz team is actively investigating the issue and monitoring Azure’s recovery efforts. We will provide updates here as soon as more information becomes available.

Thank you for your patience and understanding while we work to restore full functionality.

Posted Oct 29, 2025 - 09:08 PDT

This incident affected: Web App, Automated Workflows, Email Delivery, API (v2) and Mobile Apps (iOS App, Android App, Windows App).