This post is the last in a series discussing the Neon outages on 2025-05-16 and 2025-05-19 in our AWS us-east-1 region. In this post, we cover the IP allocation failures that persisted through the majority of the disruption. For further details, read our top-level Post-Mortem here.

Summary

Neon separates Storage and Compute to provide Serverless Postgres. Our Compute Instances run in lightweight Virtual Machines in Kubernetes, each Compute running in its own Pod.

On 2025-05-16, the Neon Control Plane’s periodic job responsible for terminating idle Computes started failing, eventually resulting in our VPC subnets running out of IP addresses in two of three availability zones. Configuration changes to AWS CNI to free up IP addresses, while beneficial in the immediate term, later prevented returning to a healthy state. A post-incident follow-up to revert this temporary state on 2025-05-19, resulted in similar issues.

During this investigation, we have learned a lot about the behaviour of the AWS CNI plugin, how it interacts with our highly-dynamic environment, and have filed an improvement PR.

This article covers how the incident happened and details about what we learned about AWS CNI through our post-mortem and root-cause investigation.

Glossary of terms

  • AWS CNI: refers to the AWS VPC CNI plugin. A more in-depth description of AWS CNI is provided below.
  • ipamd: part of AWS CNI, refers to the L-IPAM daemon
  • AWS ENI: AWS Elastic Network Interface; ENIs are allocated to EC2 instances and are associated with a subnet
  • AWS VPC: logically isolated virtual network provided by AWS
  • AWS VPC subnet (or subnet): represents a range of IP addresses in a VPC
  • Allocated IPs: AWS subnet IPs allocated to ENIs
  • Assigned IPs: IP addresses assigned to Kubernetes Pods (most Pods in our clusters are Neon Computes)
  • Total IPs: total IP addresses available for allocation in a subnet (or subnets)

2025-05-16: Running out of IP addresses

Neon operates Kubernetes clusters in 11 cloud regions. Our us-east-1 cluster in AWS typically operates a daily peak of 6,000 running databases (which we call Computes), with an incoming rate of 500 new Pods started every minute and a similar rate of terminating idle databases.

When the incident started, the job responsible for shutting down idle databases failed (we have described this in more detail in a separate post). As terminations were not processed, but creations continued, the number of running Computes quickly rose past our cluster’s typical operating conditions, reaching ~8k active Computes in the space of a few minutes.

At ~8k active computes, our AWS VPC subnets ran out of IPv4 addresses. This was unexpected, as we test our clusters for up to 10k Computes, and our subnets were sized to a total of 12k IP addresses!

This image has an empty alt attribute; its file name is AD_4nXckDcMUel8SIPntjpHfv8kYflSwpz1Nv0XuFNyewBs3vGRwxRp3W-r1KRk4Cv-atdHlpxosUDL3LC_wV3LXOGswGm5FhbAQ3kghwncSe_-OAWL8-7Twbk1eUFXHZHe47GEB7YRG

A summary of the conditions that led to IP allocation unavailability

  • With its default settings, AWS CNI reserves at least 1-2 extra ENIs worth of IP addresses on each node
  • Our nodes can utilize up to 49 IPv4 addresses per ENI
  • Our AWS us-east-1 region only had 12k total IP addresses instead of the 24k we have in other regions.
  • During the incident, we had ~4k extra IPs allocated on nodes that didn’t have enough CPU or memory available for new compute Pods to be scheduled.
  • As a result, we became unable to start new computes while only 8k of 12k IPs were assigned to compute Pods — at the time, this was confusing and unexpected.

Aside: Why only 12k IP addresses?

As one of our first regions, we hadn’t originally planned to run the cluster at this scale. Our load testing had indicated that our clusters could work with vertical scaling up to 10k Computes, but that after that, we would need to scale out horizontally. 

Even though each of our three subnets was configured with a /20 CIDR block (half the size of our other clusters), we assumed we would always have sufficient available IPs due to the identified upper bound of 10k active Computes.

The rate of growth of our service in recent months has been faster than anticipated, so we’ve been working in parallel on deeper architectural changes to support horizontal scaling. We will post articles describing the new architecture after we launch it.

Background: What is AWS CNI? How does it work?

Explaining the behaviour we saw, requires some understanding of how AWS CNI works.

This image has an empty alt attribute; its file name is AD_4nXfSM4sj69hMiNtF3oQdvGhpXWmEGnK4T1Kq2jN3jo5V6WZvvUtJwc9VTEqfMnRdRgg4c-6O6QCMLJoC4dWdp3B7KPPyRH41KfGXLD13be-HYdAnX7QWdEBLZ85kVPVyIaaba6ysVw

The Kubernetes Container Networking Interface (CNI) is the standard interface used for configuring Pod networking in Kubernetes. CNI plugins are called by the container runtime to set up (“add”) and tear down (“del”) networking for each Pod. “AWS CNI” is how we refer to the AWS VPC CNI plugin. At the time of this incident we were using AWS CNI v1.18.6.

Each Pod needs an IP address for networking within the cluster, and AWS CNI’s job is mostly assigning IP addresses to Pods, pulling from the appropriate VPC subnet. Internally, the CNI plugin itself makes RPC calls to ipamd — the host daemon on each node, responsible for allocating IPs from the subnet onto the ENIs attached to the EC2 instance and handing those out to Pods.

To isolate Pod starts from AWS API calls (and vice versa), ipamd keeps a pool of IP addresses – more than is strictly necessary for the number of Pods on the node. The pool is resized every few seconds by a separate reconcile loop, outside the context of any individual CNI request.

AWS CNI has several configuration options to influence how it manages its pool of IP addresses. We include details about our choice of options below.

A quick recap

Our AWS us-east-1 cluster typically operates with 5-6k active Computes. We run our Compute Pods on m6id.metal AWS instances, with 49 IP addresses per ENI (plus one IP address assigned to the network interface itself). In theory, these instances can support up to 737 pods each (or more, with prefix delegation) — in practice, we tend to run 100-400 pods per node.

It’s worth mentioning that not all databases are equal — the number of running compute Pods on any Kubernetes node is dynamic and depends on the size of the scheduled workloads. For example, a 128 CPU node can run 128 pods with 1 CPU each, 4 pods with 32 CPUs each, or any combination in between.

During Friday’s incident, our Control Plane became unable to terminate idle databases. This resulted in the number of active Compute Pods quickly rising from ~5k to ~8.1k. As new Pods exhausted all schedulable CPU and memory across the cluster, our cluster-autoscaler added more nodes.

At this point, we had old nodes without CPU or memory capacity, but with many additional allocated IPs that could never be assigned to Pods due to these scheduling constraints. This issue was not clear to us at the time. 

This image has an empty alt attribute; its file name is AD_4nXfSx9Lu6vh8ZWnZxPsKCQtOcx8PRjNZPBQeIVVx_ND7JCq4P9nMb_BfMTvpwqfnVa3tqgHYdVPUjHtF2FdSZ_6mAxYjLAK9ElAY1BlxSYjdtsqZTptISwOuKuS9fHgmr39M20_cBw

As more nodes were added, we started observing IP allocation errors when new Pods were scheduled but were unable to start.

Why did we run out of IP addresses?

Prior to the incident on Friday, we were using the default AWS CNI configuration (WARM_ENI_TARGET=1 and WARM_IP_TARGET unset, more on these later).

Prior to the incident, each of the cluster’s 3 subnets had 3.7-3.9k allocated IP addresses (stored in ipamd’s IP pools), with only 1.6-2.3k IP addresses assigned to Pods (~50% utilization). Each of our three subnets were configured with a /20 CIDR block. This meant we had up to (4096 – 5) × 3 = 12273 total IP addresses (5 IPs in each VPC subnet are reserved).

This image has an empty alt attribute; its file name is AD_4nXdFejJNHoAZS9-MxZKUYxOiCIfAKtwP6fHKmg4D730hk_TNBLhemdveVlZWk4q2IpbjfMXkqkjVVUO7OTcIE1QckWwHPtNvIiJy-X-y9PeK1d2wZWHOrpx9Kj50u5MQR7AZiuae3A
(11:09 UTC on 2025-05-16)

During the incident, with the sudden increase in running Pods, the cluster had assigned ~8.8k total IP addresses (71%). However, across our three subnets, 99% of all IPs were allocated, totaling ~12,200 out of 12,273.

Because only 8.8k IP addresses were assigned, we expected that the already allocated IP addresses would be assignable to Pods, but the result was different and unexpected: IPs allocated to old nodes were, in practice, unusable. These were allocated to nodes already at CPU/memory capacity and were also not being released by AWS CNI.

Because of this detail, it appeared that the subnets had sufficient unassigned IPs available to be used for new Pods.

This image has an empty alt attribute; its file name is AD_4nXepNljuR1XD9jB6-Fb4AhmzSxccZ4e6USceaRWW93JjoGKNOHvLwTTmGuXu20cT1qpqNqOXtT-9cbPTZ1-YIvUzVWQIodV7IaMR8lQVKB8fBjtiNQgVQo27d2SsGHAmNa-p6OyQ
(15:00 UTC on 2025-05-16)

In practice, as new nodes were added, they became unable to obtain sufficient IPs to match available CPU/memory capacity.

Why were there so many IPs allocated to nodes with no spare resources?

Overallocation of IPs has to do with AWS CNI’s behavior under the default settings, which has WARM_ENI_TARGET=1 and WARM_IP_TARGET/MINIMUM_IP_TARGET unset:

  • Whenever ipamd sees that the number of “available” IP addresses (allocated minus assigned) is less than WARM_ENI_TARGET × (IPs per ENI), it will attempt to allocate more.
  • Allocating more IPs – if none of the existing ENIs have room – will attempt to allocate an ENI’s worth of IP addresses [1, 2, 3], specifically by the code block below:
This image has an empty alt attribute; its file name is AD_4nXdxkGEuPDu3k1PfeUIKSP7zL3DKxwC81-7kTLQLsqjGr__Vw9PSWC4c-0EW9oMPPTLy14bi7IIIZbTBLl4S28MJtL1blJL7yaj7cwC24PHz2D4dvYTZataXH_KjJq0RyLBkm-XiHw

The target for available IPs means that ipamd must allocate 50-100 IP addresses above what’s needed by Pods (50 IPs per ENI for the m6id.metal instance type). Because we have a stable rate of incoming Pods in our cluster, the random distribution of Pods onto ENIs keeps all ENIs in use, once added we never free IP addresses back to the subnet.

We were surprised to find this, so we have opened a PR to AWS CNI to improve ipamd’s behavior under these circumstances.

As an example of just how severe this can be, consider a node that very occasionally peaks up to 400 active Pods, but normally has enough large workloads that it only has the CPU / memory capacity to support 200 active Pods. We might see a sequence of events such as the following:

  1. A burst of Pod starts causes the node to have 400 active pods, with no available IPs left
  2. ipamd sees that the number of “available” IPs is low, and allocates more (with a new ENI) from the VPC subnet, aiming to always maintain at least 50 available (extra) IPs
  3. As older Pods on the node are removed, their IP addresses are kept in cooldown for 30s before they can be reused – so IP addresses on the new ENIs must be used for new Pod starts during this period
  4. After the load subsides, the random distribution of Pods onto ENIs results in a high probability of having at least one Pod per ENI, causing all 50 IPs per ENI to remain allocated to the node (remember: the entire ENI must be unused to remove any of the IPs.)
  5. As a result: This node is left with 200 running pods but 450 allocated IPs!
This image has an empty alt attribute; its file name is AD_4nXfm2cCcvl5u1QWJ8MQihxzaXx_UhzTxOxwogXRiue4slsfcuCwcU6uWa8Y6vn-VkMS5aTfSwxdIdVZhfKo-suAUb9dJicyv97E_jYrqnMAdQxojLobTsAAWhY2z0C-bO4g8ra4X_g

This happened enough in practice that almost all the IP addresses in our VPC subnets were either assigned to Pods (~8.8k) or allocated to nodes with no remaining CPU/Memory capacity (~3.4k, including ENI device IPs). Once we ran out of IPs in the subnet, we were unable to start new compute Pods, which in turn meant that idle databases couldn’t be woken up.

That still leaves ~100 IP addresses in our subnets unaccounted for – we’re not certain why they weren’t allocated (and we are following up with AWS Support to understand why this was the case). However, an extra 100 IPs likely wouldn’t have helped much, since our cluster needed an additional ~3.4k IPs.

2025-05-16: WARM_IP_TARGET=1, Releasing IPs, IPs still not assignable

After our VPC subnets ran out of IP addresses, we looked to simultaneously unblock further compute Pod creation and fix our control plane’s periodic job so that old computes were terminated.

To unblock compute Pod creation, we set WARM_IP_TARGET=1. This had the immediate intended and expected effect – freeing allocated IP addresses from nodes that couldn’t use them, and allowing more Pods to start.

Once our control plane started successfully terminating idle Computes, we observed a significant drop in the rate of successful Pod starts. As we later found out, WARM_IP_TARGET=1 unexpectedly prevents new Pod starts for 30s after each Pod deletion.

Background: What does WARM_IP_TARGET do?

Above, we described how WARM_ENI_TARGET=<T> works: ipamd ensures that there are at least T ENIs worth of extra IP addresses allocated to the node, only freeing them when an entire ENI is unused.

In contrast, when WARM_IP_TARGET=<N> is set, ipamd attempts to maintain exactly N extra IP addresses on the node. More IP addresses are allocated when fewer than N IPs are available Extra IP addresses are freed if there are more than N available.

If both WARM_IP_TARGET and WARM_ENI_TARGET are set, WARM_IP_TARGET takes precedence.

Why set WARM_IP_TARGET=1?

During the incident, we observed that our VPC subnets were out of IP addresses with only 72% of those IPs actually assigned to Pods. We inferred that those unused IPs must have been allocated to nodes with no room to start new Pods and looked for a quick way to free them up.

Setting WARM_IP_TARGET appeared to be the most straightforward option.

At the time, we were operating AWS CNI with the default configuration options. During the incident, we misread the documentation, mistakenly believing that the default value for WARM_IP_TARGET was 5 when unset, leading us to decide to “reduce” it to one.

However, when this parameter is unset, AWS CNI actually bases its logic on WARM_ENI_TARGET, which has substantially different behaviour from using WARM_IP_TARGET. Unbeknownst to us, this had much larger implications than just reducing the value, many of which we didn’t understand until much later.

What happened with WARM_IP_TARGET=1?

This image has an empty alt attribute; its file name is AD_4nXfMXovY3MCjxhcoPP1jDQvZPo1Vt725ukL3B9Akwl_TSr-JVf6MpAplMdtGkI1o33Va4hhyIUNFNBeZe0omHb6HyBT7qwp1wC8zcenTCwg5vAngaVk8hjLg4qbPpJIoGkwO-5fH

The immediate effects were as we expected: Thousands of IP addresses were returned to the VPC subnets from nodes that couldn’t use them, and subsequently allocated by nodes with CPU / memory capacity to start Pods. New Pods started on these nodes, and when we eventually hit ~10k concurrent Pods, our rate of starts slowed again due to limits we’d previously identified in load testing (e.g. kube_proxy sync latency).

Soon after, we stabilized our Control Plane’s failing job and ~7k idle computes were terminated. However, the rate of successful Pod starts remained far below the pre-incident baseline.

This image has an empty alt attribute; its file name is AD_4nXe0fgvzFgyr9k-MVgy7q0m75-N80WYlnb8sJEGWG0Joj5RZsIZ7jl8k_hWYXyUwDb40NfNC3cmJdDKiLdvwLwkESSOuL-YjZtny6kHSs-l2zF122ocoz2ZDAG8RlBmdKscpni4c2A

Investigation at the time showed that most of our new Pods were failing to start due to IP assignment issues — even though VPC subnet metrics confirmed that there were thousands of unallocated IP addresses that should have been available.

At the time, we couldn’t figure out why IP allocation was failing. The AWS CNI documentation mentioned that WARM_IP_TARGET can trigger rate-limiting on EC2 API requests, however, this is not a problem we expected with only 1-2 Pod starts per node, per minute. Shouldn’t ipamd be able to request more than that?

We spent much of the time following this incident digging through AWS CNI’s source code to understand its behaviour, cross-referencing with metrics and logs we’d captured from the time of the incident.

Why did WARM_IP_TARGET=1 prevent Pods from starting?

Broadly, AWS CNI has two methods of operation, depending on whether WARM_IP_TARGET and/or MINIMUM_IP_TARGET are specified (internally referred to as the “warm target” being “defined”).

We described the default above – if there isn’t a warm target, ipamd relies on WARM_ENI_TARGET’s value to determine how many IPs to allocate. But with WARM_IP_TARGET set, ipamd has the following behavior:

This image has an empty alt attribute; its file name is AD_4nXeyUsWvD48341kg0EImY_fRdz2JhxRXO7oMsBjjAeAcNjZ9dO26fGZ4rtOnmyk15nQeMG97n7r4sCq4DAt8M-IqLh-m3cyKsgAG697UKapP4g0FFKJVCnTwhDEhOQBdJnCzcdbm

This combination of factors means that setting WARM_IP_TARGET=1 can prevent all Pod starts on a node for 30 seconds after each Pod removal, because ipamd will ensure that there’s exactly one “available” IP address (even if that IP address can’t be assigned due to the 30-second cooldown period).

That’s why we only saw this problem after our control plane started terminating idle computes. When no Pods are removed, ipamd can sustain a high rate of Pod starts, in spite of the small warm IP target. Deleting Pods, however, can simultaneously prevent assigning existing allocated IPs while also preventing ipamd from allocating more (because WARM_IP_TARGET=1 is guaranteed to be satisfied while any IP address is in cooldown).

To make matters worse, our control plane retries compute creation if it doesn’t succeed within the timeout window (currently 2 minutes). Combined with our preexisting long Pod startup times, these retries exacerbated the problem as many of the successful Pod starts were deleted shortly before the rest of setup could continue – each time resetting the 30 second countdown to being able to start more Pods.

This image has an empty alt attribute; its file name is AD_4nXdPrCahF5w-t2kxprRPxLCY-ktvPEH5bsTij3KnZFWDQu1HYupbUxUTdbD4_vXVUB-d1W8BixlcEDjlsjh-GGgeuEFO1CLb9xN9IzlbVtRE2GiLv3aFjZm29HxyH3VRvVBoXYiP

Back to the incident: What did we do at the time?

At the time, we didn’t understand why our rate of successful Pod starts was so low.

We thought that it was theoretically possible that there were still unused IP addresses somewhere, or maybe WARM_IP_TARGET=1 was misbehaving. So in a last-ditch effort to free up any other allocated IP addresses, we reduced WARM_IP_TARGET even further, to zero, also setting MINIMUM_IP_TARGET=0 and WARM_ENI_TARGET=0. This helped!

This image has an empty alt attribute; its file name is AD_4nXf236eOSV-SElGKRiwZ1IHjmBRS5dC943RQ0ASGH7ctr12iMYUWuIET4--vS9MUKqOq3lI5b7raWV8NZ_fO80hXl3yUNtEzhMzJMJ6rCDF6tDkniQt72C9peWOqjHLjQheRvx9-Sg

Unbeknownst to us at the time, setting WARM_IP_TARGET to zero is equivalent to disabling it, resulting in ipamd using the WARM_ENI_TARGET logic.

There were two key side effects of this configuration:

  1. Similarly to before the incident, ipamd effectively stopped returning IP addresses to the subnet
  2. New IP allocations from the subnet attempted to reserve as many IPs as could fit on the ENI (instead of one IP at a time)

Together, these resulted in enough IP addresses being allocated to the nodes, allowing the cluster to stabilize. The average number of allocated IP addresses on each node increased from ~75 per node to ~250, and our rate of successful Pod starts returned to normal.

This image has an empty alt attribute; its file name is AD_4nXeI8hMsY_A4N-ciWUtSgGpzkyKU7D04KFZKpXOUYct5L15-r52-mnvME9RS1Sn9wg4zpCbVtRFhrOefWbtrh2P5aEw-h3AWboFEiochYNmrQkdWYdDo3CtbNNUYyvUFaxLx_Rg1yQ

We continued to monitor the cluster over the weekend and observed a stable state until the following Monday.

2025-05-19: AWS CNI config change goes wrong, reverting doesn’t help

The following Monday, as an incident follow-up, we decided to revert the final change to our AWS CNI configuration. We believed that the state of our us-east-1 cluster after Friday’s incident was not stable, and thought that switching back to WARM_IP_TARGET=1 would help.

At the time, we were concerned that AWS CNI’s behavior with WARM_IP_TARGET=0 was unspecified, and believed that WARM_IP_TARGET=1 would be more stable.

Rolling out WARM_IP_TARGET=1 triggered the same behavior where deleting Pods interfered with our ability to start new ones. Upon observing the same conditions, we then undid that change back to WARM_IP_TARGET=0. However, IP assignment errors continued.

We compensated for high error rates by increasing the size of our pre-created compute pools. The IP assignment errors continued for hours afterwards, until a 86-second window where our control plane didn’t stop any Pods, allowing more IPs to be allocated and resolving the errors.

Why set WARM_IP_TARGET=1 again?

As is often the case, we knew much less then than we do now.

We were becoming less certain about the behavior of WARM_IP_TARGET=0. In one place, the documentation said that zero was equivalent to “not setting the variable”, but that the default was “None”, leaving us uncertain about the actual behavior with that configuration. If zero were the default, that would have been the same configuration that originally caused us to run out of IP addresses.

We also suspected that Friday’s IP assignment issues may have been resolved by coincidence and not by setting WARM_IP_TARGET=0. For example, we saw temporary improvements every time we restarted the aws-node DaemonSet (which reinitializes ipamd) — the symptoms could have been resolved by the final restart.

Remembering that setting WARM_IP_TARGET=1 had initially helped on Friday, we believed that it was likely to be more stable than the unknown situation we found ourselves in.

This was a mistake at the time. Further, reverting back to WARM_IP_TARGET=0 was not sufficient to recover from the resulting degraded state.

What happened with WARM_IP_TARGET=1 this time?

This image has an empty alt attribute; its file name is AD_4nXd_OXKNBH7SpOJjuCCfhbADtBV9NwznfXb7JZsRQIfHNn4nV3FUqhGCqiqwy3VJpze8CYzsU8RwE2yI_16quywVTFQvJ1O2SDyUZlki3HYADndXE7HvP_J3dJSAJYu_JpPhXFiaTg

IP assignment immediately started failing, and with it, our rate of successful Pod starts dropped to the same level as it was with WARM_IP_TARGET=1 on Friday.

This was unexpected. At the time, we thought Friday’s IP assignment errors were due to the cluster being left in a bad state after our VPC subnets ran out of IP addresses. Here, the errors started from a stable state — clearly inconsistent with our understanding.

Aiming to avoid further outages, we wanted to be sure of any additional configuration changes. We took some time to examine ipamd logs and eventually determined that there were likely specific issues with WARM_IP_TARGET=1.

We reverted back to WARM_IP_TARGET=0, but continued to see IP assignment errors.

This image has an empty alt attribute; its file name is AD_4nXdYCmGpejWlZxJBnFl7LUCLoaacuVB7gOz_BoP3WkudNdoQ7YA7WwRxjsAoMspVx20n4_1ymfx8_CDsIO9YbKWY-tWptkK2n08KyLQBMivvIgyAudTHpSp2SRGAd4dCxWnorMTxuw

Why didn’t reverting fix the issue?

It was very unexpected that issues persisted after reverting WARM_IP_TARGET back to 0. This was the healthy state through the weekend, so why didn’t it work now!?

The rate of errors had decreased enough for more Pods to get through, but the overall success rate was far below expectations:

This image has an empty alt attribute; its file name is AD_4nXdQZAru4TN9P327_yrZhNhiXwpiA8ERRESdV_mIxVH6mJimggBAxufH8p3ilEPctEigj0AQL4YuhVF5BzU0N04gWqFuQUgRmcWNUZnRYSMmEMU-aIXbQipO6GYF_PsQ1gG6nz762Q

In our investigation over the following weeks, we attained a deeper understanding of the AWS CNI codebase.

When WARM_IP_TARGET=0 and MINIMUM_IP_TARGET=0, AWS CNI uses the behavior for WARM_ENI_TARGET — even though we had WARM_ENI_TARGET=0 as well.

Under these conditions, ipamd will only allocate more IP addresses to the node if there are no available IP addresses. Together with our finding that IP addresses in cooldown are counted towards the number of available addresses.

This means that these settings only allow allocating more IP addresses if:

  1. All of the IP addresses on the node are assigned to Pods; and
  2. No Pods on the node were removed in the last 30 seconds

Setting WARM_IP_TARGET=1 released many of the IP addresses on our nodes. Setting it back to zero while we continued to have high Pod churn meant that ipamd never saw the necessary conditions to reallocate those IP addresses. This only happened because we had also set WARM_ENI_TARGET=0.

What did we do to work around errors in the meantime?

Internally, our control plane maintains a pool of “warm” Compute pods, so that Pod starts are not on the hot path for waking an idle database.

We were only seeing ~10% of Pods failing to start, so we were able to compensate for failures by increasing the size of the pool.

This mitigated the user impact as errors continued behind the scenes.

This image has an empty alt attribute; its file name is AD_4nXcHXzEYcqQJNShT_py6p0OhXffJaJpLesXGIoFHlnwiaGDYOx8f6frNgm-5NUL5XJsWBVwjxT05yZlOPjnveGKB6xAdGXCMHd4p-HOhp38te62rp_1pp6UjvXFuVBIpv-PHxAbF

What eventually caused the errors to stop?

This image has an empty alt attribute; its file name is AD_4nXe_seaJhnYLPN2LQmo0ur3rBFLHLIZqotDc6uJvI7Jn5VqdKkY-_CiAZ5oqlt_3Jvcrm8lOMGVYFftfkxHO-pc_bp6BS6ZdI0G-AQSWBQ6kTfVZdcJWrXW55C6n5mGIJ-ewnncBgg

Many hours after we’d mitigated user impact, we saw the IP assignment errors suddenly drop to near-zero.

We weren’t sure why at the time, but in the course of our deeper investigation, we found that this recovery was ironically due to the same issue that initially triggered Friday’s incident: Our control plane stopped shutting down idle computes for 86 seconds, due to an expensive query to the backing Postgres database. (We wrote more about this query and our changes to improve its execution plan in this related blog post.)

This image has an empty alt attribute; its file name is AD_4nXcY4_ANh3xrnDfndjlyF7HvcPmEedzK4_NCDnAzUztgTdRZ8expod3AU61-FhKnQJ53fXtVjhgLhIEEGJnJ1b68SR9BSm_gt9RXJ2VfNDKJvuMmreklQdx59uHjNcgEVQXbKlBAPg

This brief gap with no Pod deletions meant there were no IP addresses in cooldown, so further compute starts allowed ipamd’s conditions for allocating more IP addresses to be satisfied. And indeed, there were simultaneous allocations across the cluster, which in turn reduced the Pod start failure rate.

Final thoughts

This incident resulted in significant downtime for our customers and we were determined to understand the conditions that led to it, so we can prevent it – and incidents like it – from happening again.

Throughout this investigation our team learned a lot about AWS CNI internals, and we’ve even submitted a pull request to help improve the behavior for others.

In keeping with our philosophy of learning from incidents, we decided to make the investigation public. We hope that other teams can benefit from what we’ve learned, helping us all move towards a more reliable future.