Apologies for the downtime on the evening of 9/9

Apologies for the downtime with Discourse this evening—I’m pretty sure it was my fault. I enabled Encrypted Client Hello on the SCW & Eyewall domains in Cloudflare, and while the main sites were fine, Discourse didn’t seem to like having that option turned on. The fix was to turn it back off, and then to verify that the Cloudflare SSL configuration was set to “Full” instead of “Full (strict)”. (Edit, from the future: ECH had nothing to do with the problem. See the next post for the actual root cause.)

I’m honestly not 100% certain why the second step was necessary, since both SCW & E/W have been operating with Cloudflare in full-strict mode for more than two years, and Discourse uses LetsEncrypt for valid on-box SSL/TLS certs, but the forum is back up, and it’s hard to argue with results, I guess!

Will do a bit more digging tomorrow on re-enabling strict SSL mode, since I feel better with it on, and will post more here if I figure it out.

1 Like

OK, was able to spend a bit more time on this and found the real issue. It had nothing to do with ECH, fortunately, which I’ve re-enabled. The problem came down to Discourse’s built-in LetsEncrypt SSL/TLS certificate management being unable to auto-renew itself because of our Cloudflare configuration.

The short version: The short-lived LetsEncrypt SSL/TLS certificate expired (as it’s designed to do), and with no renewal in place, the Discourse forum’s SSL/TLS setup was no longer passing Cloudflare’s strict-mode SSL verification, and visitors began seeing errors when trying to load stuff here. (This is also why the downtime was only for the Discourse fourm—the main sites were unaffected.)

Expand for the longer more technical version if desired.

Out of the box, Discourse has an easy-to-use template for enabling a fully managed LetsEncrypt setup. You feed it your LE account email address, and it handles the rest—issuance, installation, renewal, updates, everything. It works great!

However, the Discourse LetsEncrypt template uses the HTTP-01 challenge for domain verification, and that presents a problem with the way we’ve got Cloudflare configured. Specifically, LE with HTTP-01 requires http port 80 be open on the Discourse host, and for our setup, Cloudflare is automatically redirecting all HTTP traffic to HTTPS. There are a couple of different workarounds to get by this, but they all come with compromises or otherwise kind of suck.

Rather than workarounds, the “correct” method is to use a different challenge, like DNS-01, which verifies the domain by attempting to create a temporary TXT record in your DNS forward lookup zone, and then later in the process checking that TXT record to make sure it exists as expected. This requires a few more configuration details—like API credentials for your DNS provider to do the TXT record creation—but it also works great with Cloudflare proxying and auto-HTTP-to-HTTPS-redirects because it doesn’t require hosts listen on port 80.

In our case, I’d initially set Discourse up without being proxied behind Cloudflare, and then switched on Cloudflare proxying a few days before we went live on August 6. Up until then, the built-in LetsEncrypt functionality was working great, and while doing the long list of things necessary to get the site ready to go public, I just flat-out didn’t think about the HTTPS-01 validation issue…until the last-issued automatic certificates expired on the evening of September 9 and things began breaking.

The fix was to disable Discourse’s built-in LetsEncrypt functionality and move SSL/TLS certificate issuance and renewal out of the Discourse docker environment and up to the host level instead. With that done and fresh new SSL/TLS certificates issued, I was able to reset Cloudflare to strict SSL and re-enable ECH, and things once again appear to be OK.

Thanks for sticking with us as we nail down what are (hopefully) the few remaining launch issues and settle into (hopefully) stability with Discourse!

2 Likes