Building an Email Archiving System: Storing the Email Body
·
4 Mar 2019

Key Takeaways
Purpose: This post outlines the first phase of building an email archiving system using SparkPost, Amazon S3, and MySQL. It explains how to duplicate, capture, and store emails for long-term access and compliance.
Core idea: The system automatically stores the raw email body (rfc822 format) in S3 and logs metadata (subject, sender, timestamp, etc.) into MySQL for fast searching and retrieval.
Essentials covered:
Creating duplicates for archival: Use SparkPost’s Archive feature to send identical copies of outbound emails to a designated archive address, ensuring the body and tracking links remain identical.
Data binding via UID: Embed a unique identifier (UID) into both the email body and X-MSYS-API metadata to link the original and archived messages.
Inbound processing: Configure an inbound domain and webhook in SparkPost to receive the archived email JSON payloads through an application collector.
Storing emails in S3: Upload the parsed rfc822 body to an S3 bucket, using lifecycle rules (e.g., transition to Glacier after one year) to reduce storage costs.
Logging metadata in MySQL: Save key fields such as RCPT_TO, FROM, SUBJECT, and S3 filename for search indexing and future retrieval.
Performance considerations: Code efficiency and minimal logging ensure the collector can handle hundreds of requests per minute with minimal latency.
Big picture: This foundation supports future enhancements—like log event storage, failure alerts, and UI visualization—laying the groundwork for a scalable, auditable email archive solution.
Q&A Highlights
What’s the goal of this project?
To create an automated email archiving system that stores message bodies in Amazon S3 while maintaining searchable metadata in a MySQL database.
Why use SparkPost’s Archive feature?
It allows you to generate exact duplicates of outgoing emails, preserving their structure and tracking data for compliance and review.
How is each archived email linked to its original message?
A unique UID is embedded into both the email body and metadata, enabling accurate cross-referencing between the original and archived copies.
Why use S3 for storage?
S3 offers scalable storage and lifecycle management options (like Glacier), making it cost-effective for long-term email retention.
What does the MySQL database store?
It stores searchable metadata fields—like subject line, sender, timestamps, and the S3 filename—allowing for efficient query and retrieval.
What are the next development steps?
Adding log event tracking, automated error reporting, a simplified collector, and a UI for viewing or resending archived emails.
In this blog, I will describe the process I went through to store the body of the email onto S3 (Amazon’s Simple Store Service) and ancillary data into a MySQL table for easy cross-referencing. Ultimately, this is the starting point for the code base that will include an application that will allow for easy searching of archived emails, and then displaying those emails along with the event (log) data. The code for this project can be found in the following GitHub repository: https://github.com/jeff-goldstein/PHPArchivePlatform.
While I will leverage S3 and MySQL in this project, by no means are these the only technologies that can be used to build an archiving platform, but given their ubiquity, I thought they were a good choice for this project. In a full-scale high volume system I would use a higher performance database than MySQL, but for this sample project, MySQL is perfect. For organizations considering PostgreSQL as their archiving database choice, implementing proper backup and restore procedures is essential for maintaining data integrity in production systems.
I have detailed below, the steps I took in this first phase of the project:
Creating the duplicate email for archiving
Use SparkPost’s Archiving and Inbound Relay features to send a copy of the original email back to SparkPost for processing into a JSON structure, then sent to a webhook collector (application)
Dismantle the JSON structure to obtain the components necessary
Send the body of the email to S3 for storage
Log an entry into MySQL for each email for cross-referencing
Creating a Duplicate of the Email
Obtaining the Archive version
In order to obtain a copy of an email for archive, you need to take the following steps:
Create a subdomain that you will send all archive (duplicate) email(s) to
Set the appropriate DNS records to have all emails sent to that subdomain to SparkPost
Create an inbound domain in SparkPost
Create an inbound webhook in SparkPost
Create an application (collector) to receive the SparkPost webhook data stream
The following two links can be used to help guide you through this process:
SparkPost technical doc: Enabling Inbound Email Relaying & Relay Webhooks
Also, the blog I wrote last year, Archiving Emails: A How-To Guide for Tracking Sent Mail will walk you through the creation of the inbound relay within SparkPost
* Note: as of Oct 2018, the Archive feature only works when sending emails using an SMTP connection to SparkPost, the RESTful API does not support this feature. That probably isn’t an issue because most emails that need this level of audit control tend to be personalized emails that are fully built out by a backend application before email delivery is needed.
Obtaining the duplicate email in a JSON structure
Storing the duplicate email in S3
Storing the Meta Data in MySQL
We grabbed all of the data necessary in a previous step, so the step of storage is easy. In this first phase I chose to build a table with the following fields:
MySQL Metadata Fields
Field | Purpose |
Date/time (auto) | Timestamp when the entry was logged |
RCPT_TO address | Target email address for the archived message |
DATE header timestamp | Original email’s sent time |
SUBJECT header | Subject line for indexing and searching |
FROM header | Sender identifier for lookup |
S3 directory | Directory path inside S3 bucket |
S3 filename | Unique .eml file stored in S3 |
The function named, MySQLLog within the upload.php application file goes through the necessary steps to open the link to MySQL, inject the new row, test the results and close the link. I do add one other step for good measure and that is to log this data into a text file. Should I do a lot more logging for errors? Yes. But I do want to keep this code lite in order to allow it to run extremely fast. At times this code will be called hundreds of times per minute and needs to be as efficient as possible. In future updates, I will add ancillary code that will process failures and email those failures to an admin for monitoring.
Wrapping it up
So in a few fairly easy steps, we were able to walk through the first phase of building a robust email archiving system that holds the email duplicate in S3 and cross-referencing data in a MySQL table. This will give us a foundation for the rest of the project that will be tackled in several future posts.
In future revisions of this project I would expect to:
Store all log events of the original email
Send storage errors to an admin when a failure to upload or log happens
Minimize the collector complexity.
Add a UI for viewing all data
Support the ability to resend the email
In the meantime, I hope this project has been interesting and helpful to you; happy sending.



