Automating TLS Certificates in Kubernetes With cert-manager

I have been paged for an expired certificate more times than I’d like to admit. Every one of those incidents was avoidable, and every one had the same root cause: a human was in the renewal loop. The fix is to take the human out of it. cert-manager does that, and once it’s running correctly you stop thinking about certificate expiry entirely — which is exactly the goal.

This is the setup I use, the failure modes I’ve hit, and how to verify it actually works before you trust it with production traffic.

What cert-manager actually does

cert-manager is a Kubernetes controller that watches for Certificate resources and makes the real world match them. You declare “I want a valid cert for api.example.com stored in this Secret,” and the controller handles issuance, stores the result, and renews it before expiry — by default at two-thirds of the certificate’s lifetime. No cron jobs, no manual openssl incantations.

It talks to issuers: Let’s Encrypt via ACME, HashiCorp Vault, a private CA, or your cloud provider’s PKI. The abstraction is the same regardless of backend, which is why it’s worth standardizing on.

Installing it

Install via the official manifest or Helm. The one thing people forget is the CRDs:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --set crds.enabled=true

Verify the three pods (controller, webhook, cainjector) are healthy before you go further:

kubectl get pods -n cert-manager

If the webhook pod isn’t ready, every Certificate you create will fail validation with a confusing API error. Always confirm webhook health first.

Setting up an issuer

An Issuer is namespaced; a ClusterIssuer works cluster-wide. For Let’s Encrypt with HTTP-01 validation:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
    - http01:
        ingress:
          ingressClassName: nginx

A hard-won lesson: start with the staging server (acme-staging-v02). Let’s Encrypt’s production rate limits are brutal — 5 failures per account per hostname per hour. If your DNS or ingress is misconfigured, you can burn through your quota in minutes and lock yourself out. Get a staging cert issuing cleanly first, then flip to prod.

Issuing your first certificate

You rarely create Certificate resources by hand. The common pattern is ingress-shim: annotate your Ingress and cert-manager creates the Certificate for you.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api
            port:
              number: 80

cert-manager sees the tls block, creates a Certificate, runs the ACME challenge, and writes the result to the api-tls Secret.

DNS-01 for wildcards and private services

HTTP-01 needs a publicly reachable endpoint. If you want a wildcard cert (*.example.com) or your service isn’t internet-facing, use DNS-01 instead. You give cert-manager credentials to your DNS provider and it writes a TXT record to prove ownership:

solvers:
- dns01:
    route53:
      region: us-east-1
      role: arn:aws:iam::111122223333:role/cert-manager

DNS-01 is more setup but it’s the only path to wildcards, and it works for internal-only hostnames where Let’s Encrypt could never reach an HTTP endpoint.

Debugging when it doesn’t issue

When a cert is stuck, walk the chain of resources cert-manager creates. Each one is a clue:

kubectl describe certificate api-tls
kubectl describe certificaterequest
kubectl describe order
kubectl describe challenge

The Challenge object is where the truth lives. Ninety percent of my issuance failures show up there as a clear message — a 404 on the .well-known/acme-challenge path (ingress routing wrong) or an NXDOMAIN (DNS not propagated). Read the Challenge events before you guess at anything.

The single fastest diagnostic:

kubectl get challenges -A

If there’s a pending challenge older than a couple of minutes, that’s your problem. If there are none and the cert is still not ready, the issue is upstream in the Certificate or Issuer.

Monitoring expiry as a backstop

Automation fails silently sometimes — a deleted Secret, a revoked ACME account, a DNS provider credential that rotated. Don’t rely solely on cert-manager noticing. Scrape its metrics and alert on certmanager_certificate_expiration_timestamp_seconds so you get warned weeks ahead if renewal stalls. The whole point is to never be surprised; a backstop alert is what makes that promise real.

(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 14

A few rules I follow

One ClusterIssuer per environment, named clearly. Mixing staging and prod issuers in one cluster is how you ship a browser-trust-failing staging cert to production.
Don’t manually edit the generated Secrets. cert-manager owns them. If you need a cert elsewhere, use a Certificate or replicate the Secret with a dedicated tool.
Pin the chart version and read the upgrade notes. cert-manager has had breaking CRD changes between majors, and a botched CRD upgrade can wedge every certificate in the cluster.

Before you roll any of this into production, get an extra pair of eyes on the manifests and RBAC. If you don’t have a reviewer handy, our AI code review is good at catching an over-broad ClusterIssuer or a Secret reference that points at the wrong namespace.

cert-manager isn’t glamorous, but it converts a recurring 3am page into a thing you genuinely forget exists. For more cluster-operations guides, see the Kubernetes & Helm category.

Automated certificate configuration still deserves human review. Verify issuer scope and DNS permissions against your own environment before trusting it with production traffic.