Monitoring mTLS Protected APIs Without a Client Certificate

We run a bunch of APIs secured by mTLS at my current place. Third parties establish connections to our API webservers by presenting a client certificate. We check that we consider it valid and if so, permit access to the relevant resources.

As our part of the mTLS handshake, we present our server certificate when clients connect. We've got a vested interest in knowing how long is left on that certificate's validity, so that we can plan to renew it in plenty of time.

We're big users of DataDog, and for most TLS monitoring, we've found the tls check that ships with the DataDog Agent to be superb. It connects to an endpoint, calculates tls.days_left and we can use that to configure up a monitor that alerts us in plenty of time. It even supports mTLS, with options provided for tls_cert and tls_private_key to allow our agent to present a client cert. Great, right?

Couple of problems here though:

  • Issuing yourself a client cert to perform "monitoring" is one more set of credentials that might have access to your app - effectively a "back door". Sure, you could write some logic into your app or webserver config to treat certain certs differently. But it's a bit clunky and easy to mess up.
  • Monitoring expiry of a server cert with a client cert? Now you have two things to monitor. How long is your monitoring cert valid for? Don't say forever! You need to also make sure you're monitoring and renewing that client cert appropriately alongside your server cert. Failure to do so means your monitoring solution is kaput. Twice the work.

Using the DataDog tls check without tls_cert and tls_private_key to monitor an mTLS API isn't an option, either. Try it and you'll see the following:

$ datadog-agent check tls
...
{
  "check": "tls.cert_validation",
  ...
  "status": 2,
  "message": "[SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:1129)",
  ...
}

And it makes sense on the face of it. DataDog's not keen on reporting the success of the TLS connection until a full handshake's been made. But for mTLS APIs, we need that client cert to complete the handshake. So the tls check is no good here.

Where do we go from here? Well, the openssl CLI tool seems to be able to fetch a server certificate just fine when you point it at the same API:

$ openssl s_client -showcerts -servername foo.com \
  -connect foo.com:443 2>/dev/null </dev/null | \
    openssl x509 -noout -dates
  notBefore=Apr 26 10:56:57 2024 GMT
  notAfter=Apr 26 10:56:57 2025 GMT
Fetching a server cert from an mTLS enabled server without a client cert (source)

Why can openssl do this where DataDog can't?

I think it's due to the ordering of the handshake operation. Namely these steps:

  1. Client connects to server
  2. Server presents its TLS certificate (ding, ding, ding!)
  3. Client verifies the server's certificate
  4. Client presents its TLS certificate (here's where we fall down...)
  5. ...etc...
  6. Handshake established!

All the info we need is in step 2. We don't need to get to step 6. The openssl tool clearly knows that and doesn't press the issue. So how do we get Datadog to do the same?

Well, we can write a DataDog custom check to submit our own metric(s). This framework makes it super-easy to write some Python to run, calculate some metric and submit the value up to DataDog for easy use. What code to write, though?

Luckily, the embedded DataDog Agent's Python comes with pyOpenSSL available to use. This is a Python library that's a wrapper around openssl, and operates differently enough to the standard Python ssl library (which DataDog and its tls check uses) to be useful to us when writing a custom check.

Namely, we can attempt to make a connection to an mTLS enabled server, catch the sslv3 alert bad certificate exception that will undoubtedly result, then use the information we've gathered already without worrying about the overall status of the handshake - we weren't going to meaningfully send any data to the server, anyway. We've already been sent the server cert and only care about its validity - so let's use it.

We have everything we need to decide whether we trust the presented cert already, too - our local CA truststore. We don't need a full handshake for trust purposes.

A straight-up Python script to do this might look something like:

#!/usr/bin/python3
import OpenSSL
import socket
import datetime

host = "foo.com"
port = 443

sock = sock.create_connection((host, port))
context = OpenSSL.SSL.Context(OpenSSL.SSL.TLS_METHOD)
connection = OpenSSL.SSL.Connection(context, sock)
connection.set_connect_state()

# think of the below as the `-servername` arg to `openssl`
# if you need to adjust it, feel free to!
connection.set_tlsext_host_name(host.encode('utf-8'))

# catch the expected 'sslv3 alert bad certificate' error on mTLS APIs
# and continue anyways, knowing we've been sent the server cert already anyway
try:
  connection.do_handshake()
  cert = connection.get_peer_certificate()
except:
  cert = connection.get_peer_certificate()
  
# calculate expiry and output to screen
today = datetime.datetime.now()
exp = datetime.datetime.strptime(
    cert.get_notAfter().decode('ascii'), 
    '%Y%m%d%H%M%SZ'
  )

left_secs = int(exp - today).total_seconds())
left_days = left_secs / 60 / 60 / 24

print(cert.get_subject().commonName)
print(left_secs)
print(left_days)
No handshake? No problem! Let's proceed regardless (source)

Translate this into a DataDog check using their docs, and ship it up as a metric with self.gauge() and you've got your validity metrics for an mTLS API's server cert without having to use a client cert. Use metrics these in your cert monitors.

It'll work with non-mTLS APIs too (where the handshake completes), but in those cases I'd probably recommend just using the tls check - it's likely more bulletproof.

Hope this helps make someone else's life a little easier.