Hey folks—apologies for the ongoing issue where the SCW homepage and story pages showed cached versions of themselves rather than the latest updates. Good news: As of this morning (28 August), I think the problem is resolved.
The short version
An ancient bit of the site’s nginx configuration was the root cause. It’s gone now!
The longer version
Briefly, nginx is the application SCW and the Eyewall use to serve up Wordpress pages to visitors like you—it’s the actual “web server” part of the web server. Nginx affixes a set of headers to each response it sends out, and among those headers is one called Cache-Control
. This header suggests to visitors’ browsers and to intermediary caches like Cloudflare how long they should be holding onto local cached copies of different components of the site (like its images, CSS, and HTML).
Both SCW’s and the Eyewall’s nginx config had a line in them telling downstream caches to be a bit more aggressive with hanging onto some components of the site—specifically, HTML responses—and that’s what was biting us. This aggressive caching was causing weirdness where some folks would see the latest versions of pages without issue, and others wouldn’t without a hard refresh.
I’ve removed the offending line from each site’s config, and at least so far, the problem behavior seems to have gone away.
The technical bits
Expand this here doodad for the gory details.
The line in question was part of the initial Wordpress config I created when Eric had me take over SCW’s hosting in 2017. The configuration has evolved a lot since then, but this particular line hung on long past its useful lifetime. The relevant portion of the nginx vhost file for SCW looks like this:
location / {
try_files $uri $uri/ /index.php?$args;
add_header alt-svc 'h3=":443"; ma=86400' always;
add_header X-Should-Katy-Evacuate 'HELL YES' always;
add_header Server 'Madred of Cardassia' always;
add_header X-Hack 'there are 4 lights' always;
add_header X-How-Many-Lights-Are-There '5' always;
add_header Referrer-Policy 'strict-origin-when-cross-origin' always;
add_header X-Content-Type-Options 'nosniff' always;
add_header X-XSS-Protection '1; mode=block' always;
add_header X-Frame-Options 'SAMEORIGIN' always;
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
add_header X-Nginx-Fastcgi-cache $upstream_cache_status always;
add_header Cache-Control "public, max-age=7200" always;
}
The problem is that final line, where we’re forcing a Cache-Control
header on the root (/
) location. (The same Cache-Control
header was being applied in other location blocks, too, but they look mostly the same and I don’t want to clutter up this post with redundant code snippets.) Explicitly setting Cache-Control
with nginx was frustrating Cloudflare’s efforts to enforce its own more appropriate cache settings. So, instead of manually setting Cache-Control
on things like dynamic HTML at the origin, the correct strategy here is to get out of the way and let Cloudflare + APO do their thing. Done!
Additionally, I needed to add a few map
rules to nginx’s config to be certain nginx wasn’t sneaking stale pages into its own local FastCGI cache (which could then be cached by Cloudflare, compounding the effect). These new map
rules read the PHP response headers from Wordpress and ensure that nginx’s FastCGI cache is bypassed for requests that Wordpress says aren’t cacheable. This is kind of a belt-and-suspenders approach to the problem, but ain’t nothing wrong with being thorough.
Why did this take so long to fix?
There’s an old joke that every IT issue is usually caused by DNS; the corollary is that when it isn’t DNS, it’s usually the cache layer. The truth behind the joke is that both DNS lookups and caching are complex problems that when misbehaving can appear to exhibit a frustrating amount of nondeterminism—that is, the systems sometimes seem to do different things in response to identical inputs.
But as with HAL9000 going nuts, there’s always a logical reason why. In HAL’s case, the issue was that he was given conflicting directions and felt that the best way to resolve them was yeeting Frank Poole off into space and 86’ing the science crew; in the case of our problems, the issue came down to leftover config lines that once served a very real purpose but that should have been removed long ago.
How do we ensure something like this doesn’t happen in the future?
We’ll do our best! I regularly review the configs for both sites, but under the hood SCW and The Eyewall are a stack of disparate technologies and applications that require some amount of ongoing integration glue in order to keep working together. There are a lot of moving parts to the machine, and sometimes a thing breaks. Fortunately, the kinds of problems we tend to have usually aren’t the kind that actually interfere with delivering accurate forecasts to readers like you—at least not so far!
Are we 100% sure this is the fix?
I think we’ve licked it, yes, but we’ll keep a closer-than-usual eye on the site for the next few days’ worth of posts so we can verify that the problem behavior has indeed stopped.
I’m still seeing stale pages showing up!
Please reply below and let me know what exactly you’re seeing (which page, how you’re accessing the site, etc), and I’ll take a look!