Network Connectivity Issues in BOM2 Region

Write-up

We recently experienced a network disruption in the BOM2 region that caused a subset of Virtual Machines to become unreachable over the public internet. The issue was reported and mitigated by our engineering team, and all systems have returned to normal operation.

Impact The impact was strictly isolated to Virtual Machines provisioned within a specific public IP subnet range in the BOM2 region. Customers with instances in this range experienced a loss of external network connectivity. Instances outside of this specific subnet were completely unaffected.

Root Cause The disruption was triggered by network switch maintenance performed by our upstream provider. During this maintenance, the upstream switches cleared their ARP caches and routing tables. As a result, the network temporarily lost the mapping required to route inbound public internet traffic to the affected Virtual Machines.

Resolution Upon identifying the routing drop, our infrastructure team immediately intervened to bypass the upstream bottleneck. We executed a network-wide forced ARP broadcast for all Virtual Machines within the impacted subnet. This proactive measure forced the upstream provider's switches to relearn the routes and immediately restored public connectivity for all affected instances, without requiring any manual reboots or actions from our customers.

Next Steps & Prevention To minimize the risk of a recurrence and reduce time-to-resolution, we are implementing the following measures:

Enhanced Network Monitoring: We are upgrading our automated alerting to specifically detect localized ARP cache drops and subnet-specific routing anomalies much faster
Upstream Coordination: Engaging with our upstream provider to ensure stricter communication protocols and advance warnings for any future switch maintenance.

We apologize for the disruption this caused to your services and appreciate your patience while our team worked to restore connectivity.

Write-up

Network Connectivity Issues in BOM2 Region

Partial outage

View the incident

Next Steps & Prevention To minimize the risk of a recurrence and reduce time-to-resolution, we are implementing the following measures:

Enhanced Network Monitoring: We are upgrading our automated alerting to specifically detect localized ARP cache drops and subnet-specific routing anomalies much faster
Upstream Coordination: Engaging with our upstream provider to ensure stricter communication protocols and advance warnings for any future switch maintenance.

We apologize for the disruption this caused to your services and appreciate your patience while our team worked to restore connectivity.