Skip to main content

Building a Bulk Asynchronous Bird Recipient Validation Tool. Need a fast, simple program? Upload a CSV, validate recipients via API, and get a clean CSV output—effortless and efficient!

Building a Bulk Asynchronous Bird Recipient Validation Tool

Key Takeaways

  • The author built a bulk recipient validation tool to validate millions of email addresses efficiently using Bird’s Recipient Validation API.
  • Node.js proved faster and more scalable than Python due to its non-blocking I/O and lack of GIL limitations.
  • The tool reads CSV files asynchronously, calls the validation API for each email, and writes results to a new CSV in real time.
  • The approach avoids memory bottlenecks and improves throughput to about 100,000 validations in under a minute.
  • Future improvements could include better retry handling, a user-friendly UI, or migrating to serverless environments for scalability.

Q&A Highlights

  • What is the purpose of the Bulk Asynchronous Recipient Validation Tool?It validates large volumes of email addresses by integrating directly with Bird’s Recipient Validation API, outputting verified results quickly without manual uploads.

  • Why was Python initially used and later replaced by Node.js?Python’s Global Interpreter Lock (GIL) limited concurrency, while Node.js allowed true asynchronous execution, resulting in far faster parallel API calls.

  • How does the tool handle large files without running out of memory?Instead of loading all data at once, the script processes each CSV line individually—sending the validation request and immediately writing results to a new CSV file.

  • What problem does the tool solve for developers?It enables email list validation at scale, overcoming the 20MB limit of SparkPost’s UI-based validator and eliminating the need to upload multiple files manually.

  • How fast is the final version of the program?Around 100,000 validations complete in 55 seconds, compared to over a minute using the UI version.

  • What issues were encountered on Windows systems?Node.js HTTP client connection pooling caused "ENOBUFS" errors after many concurrent requests, which were fixed by configuring axios connection reuse.

  • What future enhancements are suggested? Adding error handling and retries, creating a front-end interface, or implementing the tool as a serverless Azure Function for better scalability and resilience.

For someone who is looking for a simple fast program that takes in a csv, calls the recipient validation API, and outputs a CSV, this program is for you.

When building email applications, developers often need to integrate multiple services and APIs. Understanding email API fundamentals in cloud infrastructure provides the foundation for building robust tools like the bulk validation system we'll create in this guide.

One of the questions we occasionally receive is, how can I bulk validate email lists with recipient validation? There are two options here, one is to upload a file through the SparkPost UI for validation, and the other is to make individual calls per email to the API (as the API is single email validation).

The first option works great but has a limitation of 20Mb (about 500,000 addresses). What if someone has an email list containing millions of addresses? It could mean splitting that up into 1,000’s of CSV file uploads.

Since uploading thousands of CSV files seems a little far-fetched, I took that use case and began to wonder how fast I could get the API to run. In this blog post, I will explain what I tried and how I eventually came to a program that could get around 100,000 validations in 55 seconds (Whereas in the UI I got around 100,000 validations in 1 minute 10 seconds).

Approach

Validations Tested

Time to Complete

Approx. Throughput

Bulk async Node.js tool

100,000

55 seconds

~1,818 validations/sec

SparkPost UI upload

100,000

1 min 10 sec

~1,428 validations/sec

And while this still would take about 100 hours to get done with about 654 million validations, this script can run in the background saving significant time.

The final version of this program can be found here.

My first mistake: using Python

Python is one of my favorite programming languages. It excels in many areas and is incredibly straightforward. However, one area it does not excel in is concurrent processes. While python does have the ability to run asynchronous functions, it has what is known as The Python Global Interpreter Lock or GIL.

"The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter.

This means that only one thread can be in a state of execution at any point in time. The impact of the GIL isn’t visible to developers who execute single-threaded programs, but it can be a performance bottleneck in CPU-bound and multi-threaded code.

Since the Global Interpreter Lock (GIL) allows only one thread to execute at a time, even on multi-core systems, it has gained a reputation as an "infamous" feature of Python (see Real Python’s article on the GIL).

At first, I wasn’t aware of the GIL, so I started programming in python. At the finish, even though my program was asynchronous, it was getting locked up, and no matter how many threads I added, I still only got about 12-15 iterations per second.

The main portion of the asynchronous function in Python can be seen below:

async def validateRecipients(f, fh):

My second mistake: trying to read the file into memory

My initial idea was as follows:

First, ingest a CSV list of emails. Second, load the emails into an array and check that they are in the correct format. Third, asynchronously call the recipient validation API. Fourth, wait for the results and load them into a variable. And finally, output this variable to a CSV file.

This worked very well for smaller files. The issue became when I tried to run 100,000 emails through. The program stalled at around 12,000 validations. With the help of one of our front-end developers, I saw that the issue was with loading all the results into a variable (and therefore running out of memory quickly). If you would like to see the first iteration of this program, I have linked it here: Version 1 (NOT RECOMMENDED).

First, ingest a CSV list of emails. Second, count the number of emails in the file for reporting purposes. Third, as each line is read asynchronously, call the recipient validation API and output the results to a CSV file.

Thus, for each line read in, I call the API and write out the results asynchronously so as to not keep any of this data in long-term memory. I also removed the email syntax checking after speaking with the recipient validation team, as they informed me recipient validation already has checks built in to check if an email is valid or not.

Flowchart illustrating the process of validating a CSV list of emails, starting with ingestion, format checking, asynchronous API validation, result aggregation, and concluding with outputting to a CSV file..

Flowchart illustrating an email processing workflow, showing steps from ingesting a CSV list of emails to outputting results to a CSV file, with asynchronous validation via an API..

Breaking down the final code

After reading in and validating the terminal arguments, I run the following code. First, I read in the CSV file of emails and count each line. There are two purposes of this function, 1) it allows me to accurately report on file progress [as we will see later], and 2) it allows me to stop a timer when the number of emails in the file equals completed validations. I added a timer so I can run benchmarks and ensure I am getting good results.

let count = 0; // Line count require("fs") .createReadStream(myArgs[1]) .on("data", function (chunk) { for (let i = 0; i < chunk.length; ++i) if (chunk[i] === 10) count++;

Next Steps

For someone who is looking for a simple fast program that takes in a csv, calls the recipient validation API, and outputs a CSV, this program is for you.

Some additions to this program would be the following:

  • Build a front end or easier UI for use
  • Better error and retry handling because if for some reason the API throws an error, the program currently doesn’t retry the call
  • Consider implementing as a serverless Azure Function for automatic scaling and reduced infrastructure management

I’d also be curious to see if faster results could be achieved with another language such as Golang or Erlang/Elixir. Beyond language choice, infrastructure limitations can also impact performance - we've learned this firsthand when we hit undocumented DNS limits in AWS that affected our high-volume email processing systems.

For developers interested in combining API processing with visual workflow tools, check out how to integrate Flow Builder with Google Cloud Functions for no-code automation workflows.

Please feel free to provide me any feedback or suggestions for expanding this project.

Other news