I recently set up my very own mailserver, to host the toadjaune.eu domain.

I followed the very excellent tutorial at workaround.org, and don’t have much to add regarding the mail server itself.

Target audience

  • Someone who followed the workaround.org tutorial, has access to an icinga2 instance, and wants to use the latter to monitor the former
  • Someone interested into monitoring a mailserver in general (scroll through the conf examples, you’ll be fine)
  • Me, in a few months, when this contraption I built breaks

Why is monitoring so important ?

At some point, the guide says :

Setting up proper monitoring will easily fill a guide similar to the size of the one you are currently reading.

While I monitor most of my self-hosted services regardless of their importance, email is specifically important, even “just” for personal email :

  • If you make a mistake and start spamming people, you can quickly get blacklisted. Re-building your reputation can then take a very long time (we’re talking months)
  • Nobody wants to lose email. Losing email would be terrible. Right ? A kitten dies every time you miss a legitimate email. Probably.

On the other hand, monitoring a mailserver is way harder than most services.

  • Many different components interacting in subtle ways, a lot of things to monitor
  • Monitoring your server alone is not enough, you need to make sure other mailservers still accept to work with it

Okay, well, let’s get started, then !

A few notes on my specific email setup

While my setup mostly follows the workaround.org tutorial, a few noteworthy differences :

  • I’m not exposing the plaintext IMAP port (143), only the IMAPS port (993)
  • I’m not exposing the plaintext submission port (587), but the submission over TLS port (465) instead
  • No webmail, no web admin

Why ? I don’t like, and don’t trust the STARTTLS mechanism. Details on that are way off-topic, and will require a separate blogpost.

I’m using Icinga to monitor my infrastructure.

Basically, it allows me to execute checks made of arbitrary code (usually a shell script or a monitoring plugin, either on the monitored host, or somewhere else, to check something “from the outside”. This guide will focus on using Icinga, but the general approach should not be hard to transpose to any other monitoring system.

Active probing

This category of monitoring consists in sending something at the service, and observe its reaction.

Basic active probing

That’s how we’re gonna get started, with the simplest check I can possibly think of : are the TCP ports open ?

Doesn’t check much, but if this one fails… well, things are really broken.

Many values in following config examples have been redacted. Edit them as needed !

apply Service "mail-tcp-smtp" {
  check_command = "tcp"
  vars.tcp_port = 25
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-submission" {
  check_command = "tcp"
  vars.tcp_port = 465
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-imaps" {
  check_command = "tcp"
  vars.tcp_port = 993
  assign where host.name == "xxxx"
}

Well, that works.

But we can do way better without much effort.

We can actually check there’s a SMTP/IMAP server answering !

apply Service "mail-port-smtp" {
  check_command = "tcp"
  vars.tcp_port = 25
  vars.tcp_expect = "220" # 220 is the SMTP code beginning the first line sent by the server
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-submission" {
  check_command = "tcp"
  vars.tcp_port = 465
  vars.tcp_expect = "220" # 220 is the SMTP code beginning the first line sent by the server
  vars.tcp_ssl = true
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-imaps" {
  check_command = "tcp"
  vars.tcp_port = 993
  vars.tcp_expect = "OK" # First line of the IMAP protocol
  vars.tcp_ssl = true
  assign where host.name == "xxxx"
}

Notice how we’re establishing a TLS session for ports 465 and 993.

This makes this test implicitly richer : we’re checking the certificate validity at the same time.

Let’s throw in a few more lines of configuration to be more restrictive on expected latency and remaining cert duration, and we’ll be done with those basic tests :

apply Service "mail-port-smtp" {
  check_command = "tcp"
  vars.tcp_port = 25
  vars.tcp_expect = "220" # 220 is the SMTP code beginning the first line sent by the server
  vars.tcp_wtime = "0.1" # Warn above 100ms
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-submission" {
  check_command = "tcp"
  vars.tcp_port = 465
  vars.tcp_expect = "220" # 220 is the SMTP code beginning the first line sent by the server
  vars.tcp_ssl = true
  vars.tcp_wtime = "0.1" # Warn above 100ms
  vars.tcp_certificate = "15,1" # Warn if cert has less than 2 weeks left, crit if 1 day
  assign where host.name == "xxxx"
}

apply Service "mail-tcp-imaps" {
  check_command = "tcp"
  vars.tcp_port = 993
  vars.tcp_expect = "OK" # First line of the IMAP protocol
  vars.tcp_ssl = true
  vars.tcp_wtime = "0.1" # Warn above 100ms
  vars.tcp_certificate = "15,1" # Warn if cert has less than 2 weeks left, crit if 1 day
  assign where host.name == "xxxx"
}

Testing SMTP and IMAP functionnality

While checking that services answer correctly to connexion requests goes a long way, it’s far from enough. There are still a pretty huge bunch of failure modes that would go unnoticed :

  • Authentication db issues
  • Rejecting all email
  • etc…

We’ll need actual SMTP and IMAP clients for this, fortunately, there are community-submitted icinga modules, both for SMTP and IMAP. They are quite basic, but will be enough.

With those, we can easily :

  • Check that we can actually send an email via SMTP
  • Check that we can see the content of a mailbox

Install them manually, create a /etc/msmtprc according to the example or the docs, then use the following icinga config :

object CheckCommand "check_smtp" {
  command = [ "/opt/check_smtp/check_smtp.sh" ]
  arguments += {
    "-A" = "$smtp_account$" # msmtp account (configured in /etc/msmtprc)
    "-R" = "$smtp_recipient$"
    "-S" = "Icinga Test" # Message subject
    "-M" = "This is an Icinga automated test" # Message content
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
  }
}

object CheckCommand "check_mailbox" {
  command = [ "/opt/check_mailbox/check_mailbox.sh" ]
  arguments += {
    "-H" = "$mailbox_host$"
    "-C" = "$mailbox_credentials$"
    "-M" = "$mailbox_mailbox$"
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
  }
}

apply Service "send-email-local" {
  check_command = "check_smtp"
  vars.smtp_account = "icinga-test-local"
  vars.smtp_recipient = "icinga-test-local@example.com"
  assign where host.name == "xxxx"
}

# Check the mailbox can be read
apply Service "read-email-local" {
  check_command = "check_mailbox"
  vars.mailbox_host = "imaps://mail.example.com"
  vars.mailbox_credentials = "icinga-test-local@example.com:MySuperPassword"
  vars.mailbox_mailbox = "INBOX"
  assign where host.name == "xxxx"
}

Testing the entire delivery chain locally

The new SMTP and IMAP checks we just added are nice, but still insufficient. Let’s tackle a few more failure modes :

  • Email being accepted but not delivered correctly
  • IMAP not showing new messages
  • Etc…

So, we’d like to send an email, and make sure we actually receive it.

We already know how to send an email through a check, and check for email presence in a mailbox.

So, what we’re missing, is a way to ensure we’re checking the newly arrived email. We’ll be doing that simply by making sure the mailbox is not empty, while automatically deleting messages older than a certain amount of time.

To do that, we use the autoexpunge feature of dovecot, combined with a sieve filter. It’s not exactly elegant nor really robust, but it served me well so far.

# /etc/dovecot/conf.d/15-mailboxes.conf
namespace inbox {
  […]
  # Custom mailbox added for server monitoring account
  mailbox autoexpunge_6m {
    autoexpunge = 6m
  }
}
# /var/vmail/example.com/icinga-test-local/.dovecot.sieve
# NB : in case of sieve errors, dovecot puts logs in
# /var/vmail/example.com/icinga-test-local/.dovecot.sieve.log
require ["fileinto", "mailbox"];
if header :is "subject" "Icinga Test" {
  fileinto :create "autoexpunge_6m";
}

This being done, we need to slightly adapt the previous configuration to check for the number of messages, and change the target mailbox :

object CheckCommand "check_smtp" {
  command = [ "/opt/check_smtp/check_smtp.sh" ]
  arguments += {
    "-A" = "$smtp_account$" # msmtp account (configured in /etc/msmtprc)
    "-R" = "$smtp_recipient$"
    "-S" = "Icinga Test" # Message subject
    "-M" = "This is an Icinga automated test" # Message content
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
  }
}

object CheckCommand "check_mailbox" {
  command = [ "/opt/check_mailbox/check_mailbox.sh" ]
  arguments += {
    "-H" = "$mailbox_host$"
    "-C" = "$mailbox_credentials$"
    "-M" = "$mailbox_mailbox$"
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
    "-n" = "1"    # Minimum expected number of emails
    "-N" = "2"    # Maximum expected number of emails
  }
}

apply Service "send-email-local" {
  check_command = "check_smtp"
  vars.smtp_account = "icinga-test-local"
  vars.smtp_recipient = "icinga-test-local@example.com"
  assign where host.name == "xxxx"
}

# Check the mailbox we send emails to contains email
apply Service "read-email-local" {
  check_command = "check_mailbox"
  vars.mailbox_host = "imaps://mail.example.com"
  vars.mailbox_credentials = "icinga-test-local@example.com:MySuperPassword"
  vars.mailbox_mailbox = "autoexpunge_6m"
  assign where host.name == "xxxx"
}

Overall, the succession of events looks like :

  • Every 5 minutes, the send-email-local check sends an email from icinga-test-local@example.com to itself, via SMTP
  • Dovecot receives it, executes the sieve script, and files the mail into the autoexpunge_6m mailbox
  • Every 5 minutes, the read-email-local check connects via IMAP, and alerts if there are not exactly 1 or 2 messages.
  • Every action manipulating the mailbox (either receiving or reading) deletes any mail older than 6 minutes.

You’ll notice the 1-minute difference between check frequency and retention. That’s because we can’t synchronize the two tests, and there might be a race condition on the deletion test if everything were set to 5 minutes. Just allowing a bit of overlap gets rid of this issue.

Testing the entire chain through an external mail provider

With all the tests we have now, we’re pretty confident that the basic email workflow works correctly.

Locally.

Most of us will want to also exchange email with people on other mailservers, right ?

That’s where email gets tricky, there’s so much stuff you can get wrong ! SPF, DKIM, DMARC, IP blacklists, domain blacklists…

So instead of trying to implement it ourselves, why not let someone else do it ? Google, for example.

After all, if Google accepts my emails, and can send me some, that’s a good indication that everything is fine.

So, how are we going to do that ?

Well, we have almost everything we need, we’ll just add a gmail address somewhere in the workflow of the previous step :

  • A periodic check connects to your server as icinga-test-local@example.com, and asks to send an email to example.icinga.test@gmail.com
  • SMTP magic happens
  • The test email is delivered to example.icinga.test@gmail.com
  • gmail forwards it back to icinga-test-local@example.com
  • SMTP magic happens
  • The test email is delivered to icinga-test-local@example.com
  • Sieve files it in the autoexpunge_25h mailbox
  • A periodic check checks that the test mail is present
  • Test emails older than 25h are deleted whenever someone reads/writes to the autoexpunge_25h mailbox

Funnily enough, all we need to do for that is :

  • Create a test gmail account
  • Configure it to forward emails back to your test address
  • Change the test email recipient
  • Reduce the frequency of mail sending and cleanups to avoid being considered spam
  • Change the mail title and content to avoid being considered spam

That’s it !

Here is the final configuration for this setup :

# /etc/dovecot/conf.d/15-mailboxes.conf
namespace inbox {
  […]
  # Custom mailbox added for server monitoring account
  mailbox autoexpunge_25h {
    autoexpunge = 25h
  }
}
# /var/vmail/example.com/icinga-test-gmail/.dovecot.sieve
# NB : in case of sieve errors, dovecot puts logs in
# /var/vmail/example.com/icinga-test-gmail/.dovecot.sieve.log
require ["fileinto", "mailbox"];
if header :is "subject" "Icinga Test" {
  fileinto :create "autoexpunge_6m";
}
# Icinga config
object CheckCommand "check_smtp" {
  command = [ "/opt/check_smtp/check_smtp.sh" ]
  arguments += {
    "-A" = "$smtp_account$" # msmtp account (configured in /etc/msmtprc)
    "-R" = "$smtp_recipient$"
    "-S" = "Icinga monitoring report" # Message subject
    # Message content
    # We pretend to send some kind of report to avoid being flagged as spam
    "-M" = {{{
List of hosts currently down :

None

See https://icinga.example.com for extra details.
           }}}
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
  }
}

object CheckCommand "check_mailbox" {
  command = [ "/opt/check_mailbox/check_mailbox.sh" ]
  arguments += {
    "-H" = "$mailbox_host$"
    "-C" = "$mailbox_credentials$"
    "-M" = "$mailbox_mailbox$"
    "-w" = "500"  # Warn above 500ms
    "-c" = "1000" # Critical above 1s
    "-n" = "1"    # Minimum expected number of emails
    "-N" = "2"    # Maximum expected number of emails
  }
}

apply Service "send-email-gmail" {
  check_command = "check_smtp"
  vars.smtp_account = "icinga-test-gmail"
  vars.smtp_recipient = "example.icinga.test@gmail.com"
  assign where host.name == "xxxx"
}

# Check the mailbox we send emails to contains email
apply Service "read-email-gmail" {
  check_command = "check_mailbox"
  vars.mailbox_host = "imaps://mail.example.com"
  vars.mailbox_credentials = "icinga-test-gmail@example.com:MySuperPassword"
  vars.mailbox_mailbox = "autoexpunge_25h"
  assign where host.name == "xxxx"
}

You’ll notice I chose to pretend to send some kind of monitoring report to avoid spam detection. That worked fine so far, I hope it will stay so.

I especially hope I won’t have to disable spam filtering on the example.icinga.test@gmail.com email address, as doing so may miss detection of issues with domain/IP reputation.

(edit : well that didn’t work. I eventually had to explicitly allow those messages on the gmail side.)

That’s all the active monitoring I’ll be doing for now. In the end I kept all the steps described here in parallel, except the second, which was really redundant with the local mail loop.

Metric-based monitoring

As opposed to active probing, metric-based monitoring is purely passive. It will monitor various aspects of the system without interacting directly with it nor changing its behaviour.

Classic system monitoring

Here we’re talking about the classic system metrics : CPU/RAM usage, load average, disk IO/space, etc…

While in my opinion those tests are not really good at monitoring the well-behaving of a service, they basically come for free, and can catch some failure modes (best example being running out of disk space).

Just set them up the same way you would on any other hosts, and let’s move on to something more interesting.

“Business” metrics

This is the most end-of-the-game monitoring type I can think of. Arguably, one of the hardest to setup, but also one of the most efficient forms of monitoring, if used correctly.

For this, you need to choose metrics that describe how well the system does what it’s supposed to (hence the “business” name).

For a mailserver, this could be, for example :

  • Number of emails sent per unit of time
  • Number of emails received per unit of time
  • Number of user connections
  • Number of user operations (reading/moving/deleting mail)

As you might expect, those get more and more significant as your userbase and traffic grow.

For a personal mailserver, with a single user, most metrics I can think of are worthless, so I won’t be setting up proper monitoring in this way.

The only useful thing I can think of is monitoring if something is stuck in the mail queue.

Monitoring the mail queue

The classic mailq command does a fine job of displaying the contents of the mail queue in a human-readable format, but it’s not so great for parsing.

Fortunately, postfix provides the postqueue command, which has a json output.

Let’s write a little script :

#!/usr/bin/env python3
# Requires at least python3.5

import subprocess
import json
import time

# Parameters
max_time_in_queue = 3600 # Raise an alert if an email has been stuck in the queue for more than 1h

# Check the current mail queue, output contains a one-line json for each mail in the queue
queue = subprocess.run(["/usr/sbin/postqueue", "-j"],capture_output=True,text=True)

# Hard fail the script if the subcommand somehow failed
queue.check_returncode()

return_code = 0

# NB : the [:-1] syntax removes the empty string caused by the trailing \n (when non-empty string) or by splitting an empty string.
for message_json in queue.stdout.split("\n")[:-1]:

    message = json.loads(message_json)

    # Time the message has spent in the queue
    enqueued_for = time.time() - message["arrival_time"]

    if enqueued_for > max_time_in_queue:
        print(message_json)
        return_code = 2 # Critical

exit(return_code)

Nothing fancy here :

  • We get the list of mails currently in the queue
  • We check if any of them has been in there for more than an hour
  • If so, we print the corresponding json (to have details in the alert), and return a 2 code (which indicates the service is in critical state)

Notice how we only alert for email actually stuck in here (longer than an hour) instead of just for mail presence. This is to avoid false positive on two situations :

  • Email being processed just at the time of the check. It’s not stuck, it will be gone in a few seconds, bad luck, you got an alert on that.
  • Transcient failures on the recipient side. The email system is designed to be incredibly resilient : a receiving server might be down some time, you’re supposed to retry for some time. RFC5321 mentions 4 to 5 days, even if some servers stop retrying sooner.
  • Greylisting : Some servers pretend to be temporarily down or overloaded, and ask you to retry ; as an attempt to fight spam (spammers usually won’t retry).

Let’s add a few lines of icinga config to execute this script over ssh :

object CheckCommand "check_mailq" {
  import "by_ssh_custom"
  vars.by_ssh_command = [ "check_mailq" ]
  # Custom check script that we need to install manually on the mailserver
}

apply Service "mail-queue" {
  check_command = "check_mailq"
  assign where host.name == "xxxx"
}

What else ?

Monitoring can take many forms. While the previously mentioned are among the most common, there’s often more.

Checking blacklists

The email ecosystem relies a lot on reputation and blacklists (of domains, IPs, etc…) to determine if an email is legitimate or not.

Even though the loop through gmail gives us a pretty good confidence on our reputation, there’s no harm in trying to check it directly.

There are many online services for that, but I couldn’t find any with some free plan.

I however found a nice script, apparently able to check for the presence in most blacklists of a domain or an IP.

Small trick to test such a script : check your home IP address with it. Most ISPs blacklist the IP ranges they assign to customers, because who would host a mailserver in their living room, besides you and me ? Nobody. What is very common, hovever, is people getting hacked, and their computers used as part of botnets. Sending spam is one of the things such botnets can be used for. So, from a classic ISP, sending mail is not a good idea.

TL;DR : blcheck seems to work just fine. We just miss IPv6 support.

Let’s wrap it so that it can be used by icinga :

#!/usr/bin/env python3
# Requires at least python3.5
# Put this script in /usr/local/bin/check_blacklist

import subprocess
import sys

return_code = 0

sys.argv.pop(0) # Remove the first argument, which is the script name

for check_target in sys.argv:

    # We don't capture output, it goes directly to stdout/stderr
    check_result = subprocess.run(["/opt/blcheck/blcheck", "-p", check_target])

    if check_result.returncode > 0:
        return_code = 2 # Critical

exit(return_code)
object CheckCommand "check_blacklist" {
  command = [ "check_blacklist" ]
  arguments = {
    "bare_domain" = { # Domain we want to check
      value = "example.com"
      skip_key = true
    }
    # I'm not sure if checkbl needs the raw domain or the mailserver, so I check both
    "mailserver" = {
      value = "mail.example.com"
      skip_key = true
    }
    "server_ipv4" = {
      value = "$address$"
      skip_key = true
    }
    # TODO : blcheck currently doesn't support IPv6
    # https://github.com/IntellexApps/blcheck/issues/25
  }
  timeout = 10m # The test can take a significant amount of time
}

apply Service "mail-blacklist" {
  check_command = "check_blacklist"
  check_interval = 24h
  assign where host.name == "hedwig"
}

And that’s it !

DMARC reports

I’m getting daily DMARC reports, mostly from Google (because of the above setup). Those reports could be really useful, for example to detect misconfigurations of your mailserver (broken DKIM…), illegitimate servers trying to send email, mailing-list servers failing to relay your email…

They currently just pile up in a separate inbox, but I’d like to make something out of them, eventually.

Oh well, I guess it’ll have to be in a future post !

Conclusion

Well, we’ve come a long way !

Funnily enough, I was expecting monitoring a mailserver to be way harder.

In the end, it does take some time to setup, but you can reach a decent monitoring without excessive effort.

With all of this, I finally feel comfortable enough to actually start using my mailserver as main mail provider.

I hope this post was useful to you :)