Building an Email Archiving System: Part 1

Key Takeaways

Email archiving is increasingly essential for regulatory, compliance, and auditing environments.
SparkPost does not store email bodies, but its Archive feature allows senders to receive duplicate messages that mirror tracking links and content.
Email bodies can be stored in Amazon S3, while message event metadata can be stored in MySQL for querying and cross-referencing.
SparkPost message events provide rich activity logs (bounces, deliveries, clicks, opens, unsubscribes, complaints, and more).
Archival copies are only generated when emailing via SMTP.
Message events for original, archive, CC, and BCC emails share a common transmission_id.
Inbound Email Relay can ingest archived messages but does not include the transmission_id, creating a data-linking challenge.
Embedding a hidden unique identifier (UID) in the message body closes that gap and ties inbound content to outbound logs.
Combining archive emails + message events enables building a searchable, auditable archive system.
The long-term project includes code releases for storing archive emails in S3 and logging event data in MySQL.
The final application will allow easy searching, viewing, and reconciling email content with all related event history.
Ideal for compliance-heavy industries that need complete visibility into every message sent.

About a year ago I wrote a blog on how to retrieve copies of emails for archival and viewing but I did not broach the actual storing of the email or related data, and recently I wrote a blog on storing all of the event data (i.e. when the email was sent, opens, clicks bounces, unsubscribes, etc) on an email for the purpose of auditing, but chose not to create any supporting code.

With the increase of email usage in regulatory environments, I have decided it is time to start a new project that pulls all of this together with code samples on how to store the email body and all of its associated data. Over the next year, I will continue to build on this project with the aim to create a working storage and viewing application for archived emails and all log information produced by SparkPost. SparkPost does not have a system that archives the email body but it does make building an archival platform fairly easy.

In this blog series, I will describe the process I went through in order to store the email body onto S3 (Amazon’s Simple Store Service) and all relevant log data in MySQL for easy cross-referencing. For production archiving systems that require robust database backup strategies, consider implementing a comprehensive PostgreSQL backup and restore process to ensure your archival data is properly protected. Ultimately, this is the starting point for building an application that will allow for easy searching of archived emails, then displaying those emails along with the event (log) data. The code for this project can be found in the following GitHub repository: PHPArchivePlatform on GitHub

This first entry of the blog series is going to describe the challenge and lay out an architecture for the solution. The rest of the blogs will detail out portions of the solution along with code samples.

The first step in my process was to figure out how I was going to obtain a copy of the email sent to the original recipient. In order to obtain a copy of the email body, you need to either:

**Email Body Capture Options
**

Method	Who creates the copy	Reflects tracking changes	Automation friendly	Used in this solution
Capture before send	Application	❌ No	✅ Yes	❌
Email server stores copy	Mail server	✅ Yes	❌ Limited	❌
SparkPost Archive feature	SparkPost	✅ Yes	✅ Yes	✅

Capture the email body before sending the email
Get the email server to store a copy
Have the email server create a copy for you to store

If the email server is adding items like link tracking or open tracking, you can’t use #1 because it won’t reflect the open/click tracking changes.

That means that either the server has to store the email or somehow offer a copy of that email to you for storage. Since SparkPost does not have a storage mechanism for email bodies but does have a way to create a copy of the email, we will have SparkPost send us a duplicate of the email for us to store in S3.

This is done by using SparkPost’s Archive feature. SparkPost’s Archive feature gives the sender the ability to tell SparkPost to send a duplicate of the email to one or more email addresses and use the same tracking and open links as the original. SparkPost documentation defines their Archive feature in the following manner:

Recipients in the archive list will receive an exact replica of the message that was sent to the RCPT TO address. In particular, any encoded links intended for the RCPT TO recipient will be identical in the archive messages

The only differences from the RCPT TO email are that some of the headers will be different since the target address for the archiving email is different, but the body of the email will be an exact replica!

If you want a deeper explanation here is a link to the SparkPost documentation on creating duplicate (or archive) copies of an email.

As a side note, SparkPost actually allows you to send emails to cc, bcc, and archive email addresses. For this solution, we are focused on the archive addresses.

* Notice * Archived emails can ONLY be created when injecting emails into SparkPost via SMTP!

Now that we know how to obtain a copy of the original email, we need to look at the log data that is produced and some of the subtle nuances within that data. SparkPost tracks everything that happens on its servers and offers that information up to you in the form of message-events. Those events are stored on SparkPost for 10 days and can be pulled from the server via a RESTful API called message-events, or you can have SparkPost push those events to any number of collecting applications that you wish. The push mechanism is done through webhooks and is done in real time.

Currently, there are 14 different events that may happen to an email. Here is a list of the current events:

Bounce
ClickDelay
Delivery
Generation Failure
Generation Rejection
Initial Open
InjectionLink Unsubscribe
List Unsubscribe
Open
Out of Band
Policy RejectionSpam Complaint

* Follow this link for an up to date reference guide for a description of each event along with the data that is shared for each event.

Each event has numerous fields that match the event type. Some fields like the transmission_id are found in every event, but other fields may be more event-specific; for example, only open and click events have geotag information.

**
Identifiers Used in the Archiving System
**

Identifier	Where it originates	Shared across	Purpose	Limitation
transmission_id	SparkPost outbound	Original, archive, cc, bcc	Correlates all message events	Not available in inbound relay
message_id	SparkPost outbound	Original + archive	Identifies individual messages	Different for cc/bcc
Hidden UID	Injected by sender	Outbound + inbound	Links archived email body to events	Must be custom-implemented

One very important message event entry to this project is the transmissionid. All of the message event entries for the original email, archived email, and any _cc and bcc addresses will share the same transmission_id.

There is also a common entry called the message_id that will have the same id for each entry of the original email and the archived email. Any cc or bcc addresses will have their own id for the message_id entry.

So far this sounds great and frankly fairly easy, but now is the challenging part. Remember, in order to get the archive email, we have SparkPost send a duplicate of the original email to another email address which corresponds to some inbox that you have access to. But in order to automate this solution and store the email body, I’m going to use another feature of SparkPost’s called Inbound Email Relaying. What that does, is take all emails sent to a specific domain and process them. By processing them, it rips the email apart and creates a JSON structure which is then delivered to an application via a webhook. See Appendix A for a sample JSON.

If you look real carefully, you will notice that the JSON structure from the inbound relay is missing a very important field; the transmission_id. While all of the outbound emails have the transmission_id with the same entry which binds all of the data from the original email, archive, cc, and bcc addresses; SparkPost has no way to know that the email captured by the inbound process is connected to any of the outbound emails. The inbound process simply knows that an email was sent to a specific domain and to parse the email. That’s it. It will treat any email sent to that domain the same way, be it a reply from a customer or the archive email send from SparkPost.

So the trick is; how do you glue the outbound data to the inbound process that just grabbed the archived version of the email? What I decided to do is to hide a unique id in the body of the email. How this is done is up to you, but I simply created an input field with the hidden tag turned on.

I also added that field into the metadata block of the X-MSYS-API header which is passed to SparkPost during injection. This hidden UID will end up being the glue to the whole process, and is a main component of the project and will be discussed in depth in the following blog posts.

Now that we have the UID that will glue this project together and understand why it’s necessary, I can start to build the vision of the overall project and corresponding blog posts.

Capturing and storing the archive email along with a database entry for searching/indexing
Capture all message event data
Create an application to view the email and all corresponding data

Here is a simple diagram of the project:

The first drop of code will cover the archive process and storing the email onto S3, while the second code drop will cover storing all of the log data from message-events into MySQL. You can expect the first two code drops and blog entries sometime in early 2019. If you have any questions or suggestions, please feel free to pass them along.

Happy Sending.
– Jeff

Appendix A:

Q&A

Why build your own email archiving system?

Regulated industries often require long-term storage of both the email body and all associated event logs. SparkPost does not store message bodies, so building a custom system ensures compliance, auditing, and visibility.

How do you obtain an exact copy of the original sent email?

SparkPost’s Archive feature sends a duplicate of every outbound email to designated archive addresses, preserving all encoded links and tracking behaviors.

Why can’t you capture the email body before sending?

Pre-send capture doesn’t include SparkPost’s modifications (open tracking, click tracking, link encoding). Using Archive copies ensures your saved version exactly matches what recipients receive.

Does SparkPost archive emails automatically?

No. SparkPost does not store message bodies. Archive copies must be requested by specifying archive addresses during SMTP injection.

What is stored where in this archiving system?

Email body → Amazon S3
Message event logs → MySQL
This separation supports fast search, structured queries, and inexpensive object storage.

How long does SparkPost retain event data?

SparkPost stores message events for 10 days. After that, the data must be ingested via webhook or queried and stored elsewhere.

What message events are available?

SparkPost currently exposes 14 events, including deliveries, bounces, clicks, opens, rejections, policy issues, spam complaints, unsubscribes, and more.

What identifiers tie all events together?

All outbound messages (original, archive, CC, BCC) share the same transmission_id. The original and archive email also share the same message_id.

Why is inbound processing a challenge?

SparkPost’s Inbound Email Relay converts inbound email into JSON, but this JSON does not include transmission_id. Without additional data, the inbound copy cannot be linked to its outbound log history.

How do you connect inbound archive emails to outbound message events?

Embed a hidden unique identifier (UID) in the email body and pass the same UID in the metadata. This UID becomes the shared reference across inbound and outbound records.

How does Inbound Email Relay help automate archiving?

It receives archive emails sent to your archival domain, parses them into structured JSON, and posts them to your application via webhook—allowing automated extraction and storage.

What is the long-term vision of the project?

A complete application that:

Stores archive emails in S3
Stores all event logs in MySQL
Lets users search for emails
Displays the original email and every associated event in one unified interface

Building an Email Archiving System: The Challenges and of Course the Solution – Part 1

Key Takeaways

Appendix A:

Q&A

Why build your own email archiving system?

How do you obtain an exact copy of the original sent email?

Why can’t you capture the email body before sending?

Does SparkPost archive emails automatically?

What is stored where in this archiving system?

How long does SparkPost retain event data?

What message events are available?

What identifiers tie all events together?

Why is inbound processing a challenge?

How do you connect inbound archive emails to outbound message events?

How does Inbound Email Relay help automate archiving?

What is the long-term vision of the project?

从一个渠道开始。
准备好后，再添加其他渠道。

Building an Email Archiving System: The Challenges and of Course the Solution – Part 1

Key Takeaways

Appendix A:

Q&A

Why build your own email archiving system?

How do you obtain an exact copy of the original sent email?

Why can’t you capture the email body before sending?

Does SparkPost archive emails automatically?

What is stored where in this archiving system?

How long does SparkPost retain event data?

What message events are available?

What identifiers tie all events together?

Why is inbound processing a challenge?

How do you connect inbound archive emails to outbound message events?

How does Inbound Email Relay help automate archiving?

What is the long-term vision of the project?

从一个渠道开始。准备好后，再添加其他渠道。

从一个渠道开始。
准备好后，再添加其他渠道。