The mbox Format

How email clients store mail on your hard disk


When you look at your email client, you will — usually — find your email messages stored in folders. There's one folder for Fred, one for the duck-painters mailing list, another for work and so on. This is how mail storage looks in the email program. But how does the program store the messages on the hard disk?

Email Storage

It would be possible to have all email messages in one file, together with information in which folder each email should appear. This kind of mail storage could be, and sometimes is, implemented as a database, for example.

Another attempt would create one file for every email message and arrange them in a file system hierarchy representing the folders in the user interface. While this approach maybe has security benefits, it requires a lot of hard disk activity, which is slow.

Somewhere in between those two attempts lies another approach to storing email messages. We create a file for every mailbox and put all the messages in the corresponding folder in this file. This is the format used by most email clients, and it is called the mbox format.

The mbox Format

If we use the mbox format to store emails, we put all of them in one file. This creates more or less long text file (Internet email always only exists as 7-bit ASCII text, everything else — attachments, for example — is encoded) containing one email message after the other. How do we know where one ends and another starts?

Fortunately, every email has at least one From-line at its very beginning. Every message begins with "From " (From followed by a white space character, also called a "From_" line). If this sequence ("From ") at the beginning of a line is preceded by an empty line or is at the top of the file, we have found the beginning of a message.

So what we look for when parsing an mbox file is, essentially, an empty line followed by "From ".

As a regular expression, we can write this as "\n\nFrom .*\n". Only the very first message is different. It starts merely with "From " at the beginning of a line ("^From .*\n").

"From " in the Body

What if exactly the sequence above appears in the body of an email message? What if the following is part of an email?

...I send you the most recent report.

From this report, you need not...

Here, we have an empty line followed by "From " at the beginning of the line. If this appears in a mbox file, we unmistakably have the beginning of a new message. At least that's what the parser thinks -- and why both the email client and we would be quite confused by an email message that contains neither sender nor recipient, but begins with "From this report".

To avoid such desastrous conditions, we need to make sure "From " never appears at the beginning of a line following an empty line in the body of an email.

Whenever we add a new message to a mbox file, we look for such sequences in the body and simply replace "From" with ">From". This makes misinterpretations impossible. The example above now looks like this and no more triggers the parser:

...I send you the most recent report.

>From this report, you need not...

This is why you may sometimes find ">From" in an email where you'd expect a mere "From".

