Investigation: cert-manager ACME solver uses the jwk
field instead of kid
in neworder call for non-letsencrypt calls
In the Stackover question 70897574, user1563721 suggests that cert-manager's ACME solver is not behaving as it should with non-let's encrypt servers. More specifically, that new-order
is called using kid
instead of jwk
. In the remainder of this page, I detail how to reproduce this issue using Pebble (a smaller version of Boulder, which is the ACME server Let's Encrypt uses).
Related:
Install cert-manager but turn off the deployment:
kind create cluster
helm upgrade --install cert-manager jetstack/cert-manager --version v1.7.1 --namespace cert-manager --set installCRDs=true --create-namespace
kubectl scale deployment --replicas 0 -n cert-manager cert-manager
Make sure me
resolves to 127.0.0.1. That's due to the fact that Go binaries
do not honor the HTTPS_PROXY variable for the 127.0.0.1 and localhost domains:
sudo perl -ni -e 'print if !/^\d+\.\d+\.\d+\.\d+ me$/d' /etc/hosts
sudo tee -a /etc/hosts <<<"127.0.0.1 me"
Run mitmproxy:
curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py
curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py
mitmproxy --listen-host 0.0.0.0 -p 9090 --ssl-insecure -s /tmp/watch-stream.py -s /tmp/josejson.py \
--set client_certs=$(kubectl incluster --print-client-cert >/tmp/me.pem && echo /tmp/me.pem)
Run pebble in a separate terminal session:
go install github.com/letsencrypt/pebble/cmd/pebble@latest
go install github.com/jsha/minica@latest
cd /tmp
minica --domains me -ip-addresses 127.0.0.1,0.0.0.0
pebble -config /dev/stdin <<EOF
{
"pebble": {
"listenAddress": "me:14000",
"managementListenAddress": "0.0.0.0:15000",
"certificate": "/tmp/me/cert.pem",
"privateKey": "/tmp/me/key.pem",
"httpPort": 5002,
"tlsPort": 5001,
"externalAccountBindingRequired": false
}
}
EOF
Run the cert-manager controller in yet another terminal session:
HTTPS_PROXY=:9090 go run ./cmd/controller -v=2 --leader-elect=false --kubeconfig=<(kubectl incluster --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem | perl -pe "s/\d+.0.0.\d+/me/")
Finally, create the issuer:
kubectl apply -f- <<EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: minica
spec:
dnsNames:
- a.b.c
issuerRef:
kind: Issuer
name: minica
secretName: minica
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: minica
spec:
acme:
email: [email protected]
server: https://me:14000/dir
skipTLSVerify: true
privateKeySecretRef:
name: minica
solvers:
- http01:
ingress:
class: traefik
EOF
At this point, there is no issue. The new account is created with no problem, and subsequent new order calls work. Here is the trace using mitmproxy:
The call itself contains the kid
:
🚧 I uploaded the mitmproxy trace in case you want to check it out. You can read it with the following command:
curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py mitmproxy -s /tmp/josejson.py -r <(curl -L https://gist.github.com/maelvls/74c6584d95017a79e8ab107030b060e6/raw/da6a0c1b39d8cd7623bfef7f4dae80b3db881459/pebble-mitmproxy-traces.tar.gz | tar xzO pebble-happy.mitmproxy)
After re-starting pebble, then renewing the certificate with the following command:
curl -L https://github.com/jetstack/cert-manager/releases/download/v1.7.1/cmctl-linux-amd64.tar.gz | tar xz cmctl
./cmctl renew minica
then things go sideways:
The call to /order-plz
contains the jwk
field instead of kid
:
The response makes it clear, /order-plz wants jwk
, not kid
:
🚧 To see the recorded mitmproxy trace:
curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py mitmproxy -s /tmp/josejson.py -r <(curl -L https://gist.github.com/maelvls/74c6584d95017a79e8ab107030b060e6/raw/da6a0c1b39d8cd7623bfef7f4dae80b3db881459/pebble-mitmproxy-traces.tar.gz | tar xzO pebble-unhappy.mitmproxy)
It seems like cert-manager is too optimistic regarding 400 errors. The error:
{
"detail": "unable to find existing account for only-return-existing request",
"status": 400,
"type": "urn:ietf:params:acme:error:accountDoesNotExist"
}
should be understood by cert-manager as a non-retriable error and cert-manager should try to register the account again, without the {"returnExistingOnly": true}
part.
I realized this issue was already been documented in 2020:
The acme library first makes an account lookup call using POST /new-acct with a payload of
{"onlyReturnExisting": true}
. If the account lookup fails, then that results inurn:ietf:params:acme:error:accountDoesNotExist
, and we fail to obtain the kid (key id) which is needed for the/new-order
payload. Nevertheless, the code persists and makes the call anyway, resulting in the error message we're seeing here.
After a debugging session with delve, I discovered that the 4 calls (with the red line above) originate from one single function in acmeorders/sync.go
:
acmeOrder, err := cl.AuthorizeOrder(ctx, authzIDs, options...)
if acmeErr, ok := err.(*acmeapi.Error); ok {
if acmeErr.StatusCode >= 400 && acmeErr.StatusCode < 500 {
log.Error(err, "failed to create Order resource due to bad request, marking Order as failed")
c.setOrderState(&o.Status, string(cmacme.Errored))
o.Status.Reason = fmt.Sprintf("Failed to create Order: %v", err)
return nil
}
}
It seems like we decided not to re-check the registration in acme/setup.go, and that is causing this bug.
// If the Host components of the server URL and the account URL match,
// and the cached email matches the registered email, then
// we skip re-checking the account status to save excess calls to the
// ACME api.
Removing this check fixes things, but we get many more calls:
The culprit for all these newAccount calls (/sign-me-up
in Pebble) with {"onlyReturnExisting": true}
is a function, named accountKID
, that gets called every time a POST request is made in the x/crypto/acme library:
// accountKID returns a key ID associated with c.Key, the account identity
// provided by the CA during RFC based registration.
// It assumes c.Discover has already been called.
//
// accountKID requires at most one network roundtrip.
// It caches only successful result.
//
// When in pre-RFC mode or when c.getRegRFC responds with an error, accountKID
// returns noKeyID.
func (c *Client) accountKID(ctx context.Context) keyID {
c.cacheMu.Lock()
defer c.cacheMu.Unlock()
if !c.dir.rfcCompliant() {
return noKeyID
}
if c.kid != noKeyID {
return c.kid
}
a, err := c.getRegRFC(ctx) // 💣️ Returns urn:ietf:params:acme:error:accountDoesNotExist
if err != nil {
return noKeyID
}
c.kid = keyID(a.URI)
return c.kid
}
The above error is ignored, and the "actual" POST request e.g. NewOrder will still happen. But since the Key ID isn't set, the POST request will be performed using the jwk
field instead of kid
.
I came to the conclusion that the x/crypto/acme
library may not be RFC compliant. When the NewOrder function is called without having run Register first, you will end up with a POST request with jwk
instead of kid
.