OK, was able to spend a bit more time on this and found the real issue. It had nothing to do with ECH, fortunately, which I’ve re-enabled. The problem came down to Discourse’s built-in LetsEncrypt SSL/TLS certificate management being unable to auto-renew itself because of our Cloudflare configuration.
The short version: The short-lived LetsEncrypt SSL/TLS certificate expired (as it’s designed to do), and with no renewal in place, the Discourse forum’s SSL/TLS setup was no longer passing Cloudflare’s strict-mode SSL verification, and visitors began seeing errors when trying to load stuff here. (This is also why the downtime was only for the Discourse fourm—the main sites were unaffected.)
Expand for the longer more technical version if desired.
Out of the box, Discourse has an easy-to-use template for enabling a fully managed LetsEncrypt setup. You feed it your LE account email address, and it handles the rest—issuance, installation, renewal, updates, everything. It works great!
However, the Discourse LetsEncrypt template uses the HTTP-01 challenge for domain verification, and that presents a problem with the way we’ve got Cloudflare configured. Specifically, LE with HTTP-01 requires http port 80 be open on the Discourse host, and for our setup, Cloudflare is automatically redirecting all HTTP traffic to HTTPS. There are a couple of different workarounds to get by this, but they all come with compromises or otherwise kind of suck.
Rather than workarounds, the “correct” method is to use a different challenge, like DNS-01, which verifies the domain by attempting to create a temporary TXT record in your DNS forward lookup zone, and then later in the process checking that TXT record to make sure it exists as expected. This requires a few more configuration details—like API credentials for your DNS provider to do the TXT record creation—but it also works great with Cloudflare proxying and auto-HTTP-to-HTTPS-redirects because it doesn’t require hosts listen on port 80.
In our case, I’d initially set Discourse up without being proxied behind Cloudflare, and then switched on Cloudflare proxying a few days before we went live on August 6. Up until then, the built-in LetsEncrypt functionality was working great, and while doing the long list of things necessary to get the site ready to go public, I just flat-out didn’t think about the HTTPS-01 validation issue…until the last-issued automatic certificates expired on the evening of September 9 and things began breaking.
The fix was to disable Discourse’s built-in LetsEncrypt functionality and move SSL/TLS certificate issuance and renewal out of the Discourse docker environment and up to the host level instead. With that done and fresh new SSL/TLS certificates issued, I was able to reset Cloudflare to strict SSL and re-enable ECH, and things once again appear to be OK.
Thanks for sticking with us as we nail down what are (hopefully) the few remaining launch issues and settle into (hopefully) stability with Discourse!