The Day Our DNS Hit an Undocumented Limit in AWS

Key Takeaways

SparkPost encountered an undocumented network throughput limit on a specific AWS EC2 instance type that powered its primary DNS cluster.
Traditional instance sizing (CPU, RAM, disk) didn’t reveal this bottleneck because the issue was tied to aggregate DNS network traffic, not resource starvation.
DNS usage for high-volume outbound email is unusually heavy: SparkPost generates millions of DNS lookups for domain routing, authentication (SPF/DKIM), and AWS API interactions.
The DNS failure didn’t stem from malformed DNS responses — rather, instance-level network capacity thresholds were silently exceeded, causing widespread lookup failures.
Because AWS doesn’t explicitly document these soft network limits, diagnosing the issue required deep collaboration between SparkPost’s SRE team and AWS engineers.
The team resolved the problem by migrating DNS services to larger instance types with greater network bandwidth and redesigning parts of the DNS architecture for better isolation and failover.
No customer data or messages were lost, but the event highlighted how cloud-native architectures can hit unexpected limits at scale — and how quickly they can be fixed with AWS elasticity.

How We Tracked Down Unusual DNS Failures in AWS

We’ve built SparkPost around the idea that a cloud service like ours needs to be cloud-native itself. That’s not just posturing. It’s our cloud architecture that underpins the scalability, elasticity, and reliability that are core aspects of the SparkPost service. Those qualities are major reasons we’ve built our infrastructure atop Amazon Web Services (AWS)—and it’s why we can offer our customers service level and burst rate guarantees unmatched by anyone else in the business.

But we don’t pretend that we’re never challenged by unexpected bugs or limits of available technology. We ran into something like this last Friday, and that incident led to intermittent slowness in our service and delivery delays for some of our customers.

First let me say, the issue was resolved that same day. Moreover, no email or related data was lost. However, if delivery of your emails was slowed because of this issue, please accept my apology (in fact, an apology from our entire team). This incident reinforced the importance of having comprehensive backup strategies in place - whether you're using PostgreSQL database backups or other data protection methods to ensure business continuity during infrastructure challenges. We know you count on us, and it’s frustrating when we’re not performing at the level you expect.

Some companies are tempted to brush issues like a service degradation under the rug and hope no one notices. You may have experienced that with services you’ve used in the past. I know I have. But that’s not how we like to do business.

I wanted to write about this incident for another reason as well: we learned something really interesting and valuable about our AWS cloud architecture. Teams building other cloud services might be interested in learning about it.

TL;DR

We ran into undocumented practical limits of the EC2 instances we were using for our primary DNS cluster. Sizing cloud instances based on traditional specs (processor, memory, etc.) usually works just as you’d expect, but sometimes that traditional hardware model doesn’t apply. That’s especially true in atypical use cases where aggregate limits can come into play—and there are times you run headlong into those scenarios without warning.

We hit such a limit on Friday when our DNS query volume created a network usage pattern for which our instance type wasn’t prepared. However, because that limit wasn’t obvious from the docs or standard metrics available, we didn’t know we’d hit it. What we observed was a very high rate of DNS failures, which in turn led to intermittent delays at different points in our architecture.

Digging Deeper into DNS

Why is our DNS usage special? Well, it has a lot to do with the way email works, compared to the content model for which AWS was originally designed. Web-based content delivery makes heavy use of what might be considered classic inbound “pull” scenarios: a client requests data, be it HTML, video streams, or anything else, from the cloud. But the use cases for messaging service providers like SparkPost are exceptions to the usual AWS scenario. In our case, we do a lot of outbound pushing of traffic: specifically, email (and other message types like SMS or mobile push notifications). And that push-style traffic relies heavily on DNS.

If you’re familiar with DNS, you may know that it’s generally fairly lightweight data. To request a given HTML page, you first have to ask where that page can be found on the Internet, but that request is a fraction of the size of the content you retrieve.

Email, however, makes exceptionally heavy use of DNS to look up delivery domains—for example, SparkPost sends many billions of emails to over 1 million unique domains every month. For every email we deliver, we have to make a minimum of two DNS lookups, and the use of DNS “txt” records for anti-phishing technologies like SPF and DKIM means DNS also is required to receive mail. Add to that our more traditional use of AWS API services for our apps, and it’s hard to exaggerate how important DNS is to our infrastructure.

All of this means we ran into an unusual condition in which our growing volume of outbound messages created a DNS traffic volume that hit an aggregate network throughput limit on instance types that otherwise seemed to have sufficient resources to service that load. And as denial-of-service attacks on the Dyn DNS infrastructure last year demonstrated, when DNS breaks, everything breaks. (That’s something anyone who builds systems that rely on DNS already knows painfully well.)

The sudden DNS issues triggered a response by our operations and reliability engineering teams to identify the problem. They teamed with our partners at Amazon to escalate on the AWS operations side. Working together, we identified the cause and a solution. We deployed a cluster of larger capacity nameservers with a greater focus on network capacity that could fulfill our DNS needs without running into the redlines for throughput. Fortunately, because all this was within AWS, we could spin up the new instances and even resize existing instances very quickly. DNS resumed normal behavior, lookup failures ceased, and we (and the outbound message delivery) were back on track.

To mitigate against this specific issue in the future, we’re also making DNS architecture changes to better insulate our core components from the impact of encounters with similar, unexpected thresholds. We’re also working with the Amazon team to determine appropriate monitoring models that will give us adequate warning to head off a similar incident before it affects any of our customers.

Topic	Typical AWS Workload	SparkPost’s Outbound Email Workload
Traffic Pattern	Mostly inbound “pull” requests (web pages, APIs, media)	Heavy outbound “push” traffic (billions of emails)
DNS Dependency	Light: 1–2 lookups per content request	Very heavy: multiple DNS lookups per email + SPF/DKIM TXT checks
Query Volume	Predictable and proportional to user activity	Explodes with outbound campaigns targeting millions of domains
Bottleneck Type	CPU, memory, or storage limits	Aggregate network throughput limits on instance types
Failure Mode	Degraded latency or API timeout	DNS lookup failures causing delivery delays
Visibility	Limits typically documented and surfaced in metrics	Throughput ceiling was undocumented and invisible in CloudWatch
Mitigation Approach	Scale instance resources or add caching	Migrate to higher-bandwidth instance families + DNS architecture redesign

AWS and the Cloud’s Silver Lining

I don’t want to sugarcoat the impact of this incident on our customers. But our ability to identify the underlying issue as an unexpected interaction of our use case with the AWS infrastructure—and then find a resolution to it in very short order—has a lot to do with how we built SparkPost, and our great relationship with the Amazon team.

SparkPost’s superb operations corps, our Site Reliability Engineering (SRE) team, and our principal technical architects work with Amazon every day. The strengths of AWS’ infrastructure has given us a real leg up optimizing SparkPost’s architecture for the cloud. Working so closely with AWS over the past two years also has taught us a lot about spinning up AWS infrastructure and running quickly, and we also have the benefit of deep support from the AWS team.

If we had to work around a similar limitation in a traditional data center model, something like this could take days or even weeks to fully resolve. That agility and responsiveness are just two of the reasons we’ve staked our business on the cloud and AWS. Together, the kind of cloud expertise our companies share is hard to come by. Amazon has been a great business partner to us, and we’re really proud of what we’ve done with the AWS stack.

SparkPost is the first email delivery service that was built for the cloud from the start. This cloud-native approach is fundamental to how we design our email APIs for cloud infrastructure, ensuring scalability and reliability for developers. We send more email from a true cloud platform than anyone, and sometimes that means entering uncharted territory. It’s a fundamental truth of computer science that you don’t know what challenges occur at scale until you hit them. We found one on AWS, but our rapid response is a great example of the flexibility the cloud makes possible. It’s also our commitment to our customers.

Whether you’re building your own infrastructure on AWS, or a SparkPost customer who takes advantage of ours, I hope this explanation of what happened last Friday, and how we resolved it, has been useful.

Q&A

What happened?

SparkPost’s DNS cluster hit an unexpected AWS network throughput ceiling, causing DNS lookups to intermittently fail — which delayed email delivery.

Why did DNS break at all?

DNS is extremely dependency-heavy for outbound email. Every send requires multiple lookups (MX, TXT, SPF, DKIM), so high send volume = massive DNS traffic.

This traffic pattern exceeded an undocumented limit on the EC2 instance type hosting the nameservers.

How is DNS for email different from web applications?

Web apps mostly pull content (clients request data).
Email delivery services push traffic, triggering far more DNS lookups — often billions per month.
Email depends on DNS for routing, security validation, and failover.

How did the failure manifest?

DNS requests began dropping or timing out
Delivery queues backed up
Latency increased across parts of the system
Nothing was lost — just delayed.

Why was this hard to diagnose?

The limit wasn’t documented
AWS monitoring didn’t explicitly show the bottleneck
All traditional metrics (CPU, RAM, disk) looked normal
The issue only surfaced under a specific, high-volume DNS traffic pattern.

How did SparkPost fix it?

Upgraded to EC2 instance types with higher network throughput ceilings
Re-architected DNS clusters to be more resilient to aggregate traffic spikes
Worked with AWS to identify better signal/alerting patterns to catch this sooner

Was customer data or mail lost?

No — only delivery slowed. Once DNS stabilized, all mail resumed normal delivery.

What’s the broader lesson?

Even in the cloud, you can hit unseen scaling constraints — but cloud-native designs give you the flexibility to recover quickly.

Elasticity, partnership with AWS, and strong SRE practices make rapid recovery possible.