Building a tiny Cloudflare Zero Trust operator

7 minute read

I’ve been slowly trying to make my homelab less of a collection of “future me can work that out” notes and more of a place where the important bits are declared, reconciled, and generally not held together by one ancient shell script, a vague memory, and vibes. A robust operating model, obviously.

This is, naturally, how you end up writing a Kubernetes operator to avoid maintaining a small amount of YAML by hand. Completely proportionate. Very normal. No notes.

The latest bit of self-inflicted infrastructure therapy is cfzt-operator: a Kubernetes operator for Cloudflare Tunnel, Cloudflare Access, DNS CNAMEs, and private routes.

The moving parts are CloudflareTunnel, CloudflareExposure, CloudflareAccessPolicy, and CloudflareTunnelRoute: tunnel, hostname, policy, route. Thrilling stuff, if your idea of a good time has taken a serious turn.

The real journey was “surely someone has already GitOpsed this”, followed by finding that everything was close enough to waste a weekend on and not close enough to use. Lovely.

The itch

Cloudflare Tunnel and Access are a good fit for a homelab. I don’t have to poke holes through the firewall for every service, I get identity-aware access, and I can keep random internal apps off the public internet without pretending I’m running a bank. Handy, because “old server in the garage, she’ll be right” is probably not ISO-ready.

The manual Cloudflare workflow is fine once or twice: create a tunnel route, add DNS, create an Access app, bind a policy, then hope you remember what the hell you clicked when you move the service later.

That’s the bit I wanted gone.

Most of my cluster is GitOps-managed now: apps, databases, certs, ingress, storage, BGP, the domestic-grade enterprise cosplay. Having Cloudflare off to the side as a clickops island felt wrong. Not morally wrong. I’m not staging a royal commission into Jellyfin. Just wrong enough that I’d rather build a controller than keep a cursed spreadsheet in my skull.

What I wanted was external-dns, but for the whole Zero Trust shape: hostname, tunnel route, DNS, Access app, policy binding, LAN origins, cleanup, status, the lot. Declarative or bugger off, basically.

Why not use an existing thing?

There are plenty of ways to run cloudflared in Kubernetes. Cloudflare has examples. Charts exist. A DaemonSet is easy if you’ve got a token and a healthy tolerance for future archaeology.

That solves the connector. It doesn’t solve the lifecycle.

I didn’t just want pods. I wanted Kubernetes to say: “publish this hostname to this origin through this tunnel, protect it with Access, and clean it up later.” Existing tools were DNS-focused, annotation-first, Ingress-shaped, or happy to leave Access as somebody else’s problem, which is how all good messes start.

Could I have glued together external-dns, a cloudflared chart, Terraform, and a couple of scripts? Sure. Would I have understood it six months later after a long shift and two coffees too few? Absolutely not. Past me is a bit rubbish in the documentation department.

So I built the boring thing I wanted: CRs, reconciliation, status, finalizers, and less dashboard archaeology at stupid o’clock.

The shape

The main user-facing object is CloudflareExposure, because day-to-day I don’t think “manage a tunnel config document”, I think “publish Jellyfin”. A tunnel is infrastructure. An exposure is intent.

An exposure turns into:

hostname -> tunnel route -> DNS CNAME -> Access app -> policy binding

The important design bit is that only the Tunnel controller writes the Cloudflare tunnel config. Exposures don’t all poke at Cloudflare like they’re having a go at the same barbecue. They enqueue the tunnel, and the tunnel controller rebuilds the full ingress document from Kubernetes state.

That’s less clever, and that’s the point. Clever is where the bodies are buried.

The config is derived data: sort hostnames, append the http_status:404 catch-all, PUT the lot. No etags. No per-rule tags. No optimistic-concurrency interpretive dance at 11pm. Leader election is on, the tunnel reconciler runs one at a time, and the result is dull. Dull is good.

The actual hard part

The Cloudflare API calls weren’t the hard bit. They were annoying in the normal “real APIs are a bit cooked” way, but the hard part was ownership.

What if a DNS record already exists? What if the Access app was hand-made? What if one Cloudflare write succeeds and the next one shits itself? What can the operator delete without becoming tonight’s domestic incident?

That’s where the project got its spine.

DNS records get comments. Access apps get tags. Tunnel routes get compact comments. Tunnels and Access policies are tracked by status ID with name checks. Before mutating or deleting, the operator verifies ownership. If something looks foreign, it stops and reports ForeignResource, ForeignTunnel, ForeignPolicy, ForeignRoute, or HostnameConflict.

A clunky v1alpha1 API is survivable. An operator that deletes the wrong Cloudflare resource is how you end up explaining that the internet is broken because Dad got creative with controllers again. Character-building, I’m told.

How it looks

Install the chart:

helm install cfzt-operator oci://ghcr.io/andrewreid/charts/cfzt-operator \
  --namespace cfzt-system \
  --create-namespace \
  --version <version>

Create the Cloudflare credentials Secret:

kubectl -n cfzt-system create secret generic cloudflare-credentials \
  --from-literal=accountId='<cloudflare-account-id>' \
  --from-literal=apiToken='<cloudflare-api-token>'

Then declare a tunnel and an exposure:

apiVersion: cfzt.reid.ee/v1alpha1
kind: CloudflareTunnel
metadata:
  name: homelab
spec:
  tunnelName: homelab-rke2
  credentialsSecretRef:
    name: cloudflare-credentials
  cloudflared:
    namespace: cfzt-system
---
apiVersion: cfzt.reid.ee/v1alpha1
kind: CloudflareExposure
metadata:
  name: jellyfin
  namespace: media
spec:
  hostname: jellyfin.example.com
  tunnelRef:
    name: homelab
  origin:
    protocol: http
    host: jellyfin.media.svc.cluster.local
    port: 8096
  access:
    enabled: true
    policyRef:
      uuid: 00000000-0000-4000-8000-000000000001

That exposure produces the tunnel ingress rule, DNS CNAME, Access app, and policy binding. If raw policy UUIDs everywhere feel a bit feral, reference a managed CloudflareAccessPolicy:

spec:
  access:
    enabled: true
    policyRef:
      name: family-only

External origins are first-class too, because not everything in my house is a Kubernetes Service:

spec:
  hostname: ha.example.com
  tunnelRef:
    name: homelab
  origin:
    protocol: http
    host: homeassistant.lan
    port: 8123

Private routes are separate and intentionally small:

apiVersion: cfzt.reid.ee/v1alpha1
kind: CloudflareTunnelRoute
metadata:
  name: homelab-lan
spec:
  tunnelRef:
    name: homelab
  network: 192.168.20.0/24

Apply it and watch status:

kubectl apply -f tunnel.yaml
kubectl apply -f exposure.yaml
kubectl get cloudflaretunnels,cloudflareexposures -A

Conditions are boring: Ready and Progressing, with reasons like HostnameConflict, PolicyNotReady, and BlockedByExposures. A low bar, but Kubernetes gives us plenty of tiny mysteries already.

What it doesn’t do

This is not a general Cloudflare operator. It’s not Gateway policy management, an Ingress controller, Crossplane with a hat on, or a multi-tenant pile of pain.

It manages tunnels, cloudflared, hostname routes, DNS, Access, private routes, status, ownership, and cleanup. That’s enough. Anything more can wait until I lose another argument with myself.

Releasing it without lying to myself

The release workflow builds a candidate image and chart, runs lint, tests, chart smoke, and live Cloudflare smoke, then publishes. Publish-last, because publish-first is how you get broken releases and a quiet little shame spiral.

The live smoke is overkill for a homelab operator, which is exactly why it belongs there. Fake clients prove controller semantics. Real Cloudflare proves I haven’t misunderstood the API badly enough to embarrass myself in public. v0.1.0 made it through, so that’s basically enterprise if you squint.

The AI slop production line

This is AI slop. Useful slop, hopefully. Reviewed slop, certainly. Artisanal slop.

I drove it with Claude Code and Codex. Codex did implementation, tests, docs, and release plumbing. Claude did deep repo reviews and produced fairly brutal reports about complexity, dead branches, stale assumptions, missing tests, and reconciliation logic starting to smell. Then Codex worked through the report. A responsible adult might call this process. I call it outsourcing my nitpicking to two expensive autocomplete machines.

It worked better than it had any right to. One model built the shed, the other pointed at the load-bearing mistakes, then the first one went and found a hammer again. I still steered the design, because “make the robots argue about finalizers” is not a governance model, despite being worryingly close to one I’ve used at work.

Really, this exists because my homelab is an overengineered little systems playground. I’m trying to GitOps all of the things, make the important bits declarative, and build enough resilience that I can break one part without spelunking through browser tabs like a mug. Ridiculous? Yes. But at least now it’s ridiculous with state.

Was it worth it?

For me, yes.

The operator gives me one GitOps-shaped object per published thing. It keeps the Cloudflare dashboard from becoming a second source of truth. It handles the boring lifecycle work. It refuses to touch resources it can’t prove it owns. It lets me publish both Kubernetes Services and random LAN-hosted oddities in the same model. That’s the sort of nonsense I apparently find soothing now.

It’s also a reminder that the annoying part of operators is almost never the happy path. The happy path is a demo. The real operator is the pile of decisions around drift, deletion, ownership, partial failure, retries, status, and “what happens when a future version of me does something silly”, which is not so much a risk as a roadmap.

Could this have been a README, a Helm chart, and a firm personal commitment to never forget the manual steps? Maybe. But I’ve met me. This is better.

The code almost certainly contains some questionable moments. Still, it works, it’s tested, and it makes my little corner of Cloudflare feel like part of the cluster instead of a haunted side quest with better branding.

That’s enough for now.