How to install Traefik on Docker Swarm with certificates

Cables de red coloridos conectados a switch representando enrutamiento

Traefik has become, almost by osmosis, the default reverse proxy in Docker Swarm environments. Its label-driven declarative approach, nearly transparent Let’s Encrypt certificates, the dashboard that for the first time makes understanding routes in a cluster trivial, and very reasonable performance, have left it the comfortable winner over classic alternatives like Nginx or HAProxy. This guide is aimed at the moment you move from “works on my laptop” to “this has to receive real traffic”, a less trivial jump than it seems.

What needs to be in place before starting

Four things should be resolved before running a single command. An initialized Swarm cluster, even with a single node. A domain whose record points to the manager’s public IP. An API token at your DNS provider (Cloudflare, OVH, Route 53, the popular ones) with permission to create TXT records in the zone. And an email address for Let’s Encrypt, used to notify expirations and issues.

Ports 80 and 443 must be open toward the manager. This sounds obvious but is one of the most common causes of initial errors: a misconfigured firewall, a cloud provider blocking port 80 by default, a security group applied but not to the right node.

The network over which services talk

Traefik needs an overlay network shared with the services it will proxy. Creation is a one-time step:

docker network create --driver=overlay --attachable traefik_public

The --attachable flag lets containers that aren’t part of a Swarm service connect if needed. Not essential but saves some development headaches. From here, any service declaring traefik_public as one of its networks can be discovered by Traefik.

Certificates: HTTP challenge or DNS challenge

There are two ways to prove to Let’s Encrypt that you control the domain. The HTTP-01 challenge requires your server to respond to a specific URL on port 80, and works well if you’ll only serve certificates for the domain already pointing to that server. The DNS-01 challenge requires creating a TXT record in the domain’s DNS zone, and has an important advantage: it allows wildcard certificates (*.example.com).

For production I almost always prefer the DNS challenge. The reason is that it frees Traefik from receiving traffic on port 80 during validation (easier in load-balancer scenarios), and it allows wildcards, comfortable when you’ll deploy dozens of subdomains on the same zone without wanting to issue an individual cert for each.

DNS challenge config requires declaring, in Traefik’s static config, the provider and credentials. For Cloudflare, for example, you pass two environment variables (the email and global API key, or an API token with limited permissions) to the Traefik service. The rest is provider-managed: within seconds Traefik requests the cert, the DNS provider creates the TXT, Let’s Encrypt validates it, and the cert is issued and stored.

The certificate store

Let’s Encrypt issues certificates every 90 days, and Traefik renews them automatically. They’re kept in a JSON file on disk. For a single manager, a Docker volume suffices. For multiple managers, that file must be on shared storage (NFS, GlusterFS, JuiceFS, whatever you use); otherwise, when the Traefik service moves between nodes, it’ll request new certs from scratch, and Let’s Encrypt has quotas you’ll eventually hit.

This is one of the most often-overlooked details. It works perfectly with one node, and the day you scale to a second manager you start seeing mysterious quota errors. If you anticipate growing, solve it from the start.

Static and dynamic configuration

Traefik splits its config into two planes. The static defines things that don’t change hot: discovery providers (Docker Swarm in our case), entrypoints (the ports listened on), certificate resolvers, logging. The dynamic defines what can change without restart: routes, middlewares, services. Static usually lives in traefik.yml; dynamic lives in each service’s labels or in additional files.

In the Traefik service’s compose, the static config lives in a file mounted as a read-only volume. There you declare the ACME resolver with its DNS challenge, the two entrypoints (web for 80, websecure for 443), the Docker Swarm provider, and the dashboard if you want it enabled. There’s nothing conceptually complex in that file, just a convention you learn by reading it carefully once.

The dashboard: useful but must be protected

The dashboard is one of Traefik’s best parts. It shows what routes are active, what middlewares are applied, the state of each service. The temptation is to leave it publicly exposed “to see it”. Don’t.

Two approaches work well. One, make it accessible only through internal networking (e.g., WireGuard VPN) without exposing it through Traefik itself. Two, expose it as another service but protected by a strong authentication middleware (OAuth via Authentik, Keycloak, or at least basic auth with non-trivial credentials). The dashboard contains information that helps an attacker map your infrastructure.

The first service behind Traefik

Once Traefik is deployed, exposing a service is declarative. In the app service’s compose, labels with the traefik. prefix declare the routing rule, entrypoint, and cert to use. The minimum pattern for a web service is: a traefik.enable=true label, a traefik.http.routers.<name>.rule=Host(\app.example.com`)` label, an entrypoint label, and one saying which internal port to proxy.

Traefik automatically discovers labels as soon as the service deploys on Swarm, requests the cert if needed, and starts routing. That magic is what explains Traefik’s popularity: configuration lives with the service it configures, not in a separate proxy’s central file.

Middlewares: where policy lives

Middlewares are chains of functions applied to requests. The most useful in production are:

HTTP-to-HTTPS redirection, turning any port-80 request into a 301 to the HTTPS equivalent. Security headers, adding Strict-Transport-Security, X-Frame-Options, and similar to each response. Compression, activating gzip and brotli automatically. IP rate limiting, dampening spikes and basic attacks. Forward authentication to an external server (Authentik or similar) for routes requiring login.

Declaring a middleware once and applying it to several routers is what keeps config manageable. The pattern that works best is defining chains grouping common middlewares: a chain-base with redirect, compression, and headers, applied to everything public; a chain-oauth adding forward auth, applied to dashboards and internal tools.

What you’ll forget the first time

A handful of details that, by experience, almost everyone misses at first.

The default logLevel is too verbose for production. Dropping it to INFO reduces noise without losing useful information. The access log, in contrast, is enormously valuable: enable it in JSON format and pipe it to an aggregator like Loki or Elasticsearch. Prometheus metrics are off by default; enabling them is one line and lets you later build useful dashboards.

Traefik’s default certificate authority, if you don’t declare another, is Let’s Encrypt’s staging, which issues certs not valid for browsers. You must explicitly declare production in the ACME resolver, and confusion between them produces the classic “everything seems to work but browsers say the cert is fake” flow.

Finally, restarting the Traefik service: although Swarm can restart the service without apparent interruption, if the cert store is on a non-shared volume, you can lose the cert file during a badly managed restart. Backups of the directory where acme.json lives are basic discipline.

When Traefik isn’t the answer

Not every deployment deserves Traefik. For very simple services with a single static rule, a classic once-configured Nginx is lighter and more predictable. For architectures with very complex routing (header-based, geo-routing, non-trivial transformations), HAProxy or Envoy offer more raw power though less comfort. For very large workloads with extreme performance requirements on limited CPU, Caddy is interesting and more predictable under stress.

Where Traefik wins comfortably is in environments with many small services changing often, which is exactly the situation of most Swarm stacks. Label-based config, native Docker integration, and automatic certificates are the winning combo.

My recommendation

Start with the minimum: overlay network, Traefik service with basic static config, and a test service behind. Don’t activate the dashboard in production until you have the auth middleware resolved. Don’t try to configure all middlewares at once; add them as you actually need them.

And when things grow, take shared storage of the certificate file seriously. It’s the kind of decision that, made badly, haunts you through the first scale-up. Everything else is tunable, but a Let’s Encrypt cert with burnt quota gives you problems for days.

Two weeks living with Traefik and you’ll have the rhythm; after that, you’ll understand why many teams that try it end up unable to go back.

Entradas relacionadas