Page MenuHomePhabricator

Decrease max object TTL in varnishes
Closed, ResolvedPublic

Description

Currently we set a hard cap on object lifetime at 30 days in our VCL for all clusters (in addition to a few tighter restrictions in certain cases). I think we can/should reduce this lifetime if we can.

Possible Concerns

  1. Obviously, cache hitrate could be negatively impacted. However, I suspect this isn't a big problem in practice. If we end up reducing some long-lived objects from 30 days to, say, 14 days, the effective hitrate if the object is very hot is virtually unchanged. For example, if it's requested once per second and virtually never changes, we've gone from from an effective hitrate of 99.9999614% to 99.9999173%. The less hot an object is, the less it matters for overall perf/hitrate averages anyways.
  1. Long-lived objects help protect us in certain operational corner cases. The principle example is taking a cache cluster offline from live traffic for multiple days (e.g. due to network link risks), and then bringing it back online later without wiping (because the link was never actually down, and purges were flowing fine). In that scenario, the cache will effectively wipe itself anyways if the downtime exceeds the lifetime of most (or all) objects.

The upside is that by reducing the maximum cache lifetime, we reduce concerns and headaches related to stale objects (or at least, fears of very-stale objects) from code/asset deployers. In other words, we're able to provide a tighter guarantee of the form "Even if all else goes wrong with invalidation, nothing in this cache can possibly be older than X".

I'd like to propose that we come down first from 30 to 21 days, wait a month to make sure we've seen the effects, and then move down to 14 days, and remain at that value for the foreseeable future.

I've taken a few stats sample so far (single cache host, ~10 minute samples) to get some preliminary ideas. On the upload cluster, I'm seeing a rate of served Age: headers >= 86400 (1 day) at 0.01% of responses. On the text cluster, it maps out like:

1s+: 99.70% (age < 1s: 0.30%)
1m+: 90.71% (age < 1m: 9.29%)
1h+: 53.85% (age < 1h: 46.15%)
1d+: 37.33% (age < 1d: 62.67%)
7d+: 12.37% (age < 7d: 87.63%)
14d+: 0.70% (age < 14d: 99.30%)
21d+: 0.67% (age < 21d: 99.33%)
[original figures in description here were flawed, these are more-valid numbers]

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I was thinking about all of this last night, and we can probably mitigate some of the low-TTL-cap operational concerns with some modifications to the parameters and usage of grace mode. Grace mode works "better" in varnish4 in various ways as well, which is coming Soon(tm) (but may not be deployed everywhere until end of next quarter, and no promises at all yet). To recap a bit: grace-mode is what lets expired objects stick around in the cache storage and even get served to users in limited circumstances (maybe when busy fetching a refresh of that object and/or maybe when backends are dead/unreachable/unhealthy too).

Today in our varnish3 config, we're taking advantage of at most 1 hour of grace-time (sometimes as little as 5 minutes), and it's not clear to me that our grace-mode objects work properly or ideally when passed between multiple cache layers/tiers or differentiate in the right ways between the "fetching" and "down" cases, and further we know varnish3 in general isn't good at the grace-while-fetching thing to begin with.

If we could get it working in a reliable and predictably-useful manner (which may or may not depend on varnish4!), we could limit our standard obj.ttl cap to a value that meets the basic needs of traffic reduction and performance, while letting grace-mode extend much longer for operational/outage type scenarios.

Change 269967 had a related patch set uploaded (by BBlack):
VCL: ttl fixed/cap params vcl_fetch

https://gerrit.wikimedia.org/r/269967

Change 269968 had a related patch set uploaded (by BBlack):
VCL: drop default ttl_cap to 21 days

https://gerrit.wikimedia.org/r/269968

Change 269967 merged by BBlack:
VCL: ttl fixed/cap params vcl_fetch

https://gerrit.wikimedia.org/r/269967

Change 269968 merged by BBlack:
VCL: drop default ttl_cap to 21 days

https://gerrit.wikimedia.org/r/269968

I took a look at another small sample of data today, over on the cache_upload clusters, which we'd expect to behave very differently. This was a single 10-minute run on a eqiad upload cache. Things to keep in mind:

  1. The upload frontends are (and have been historically) limited to 1h cache lifetime. This seems "bad" from a design perspective - there's no fundamental reason to not let objects live as long as they're able in the frontends, within the 30d limits at the backends. I've left it alone so far simply because it's probably (and perhaps, accidentally) helping to paper over fallout from cache purge race conditions where a frontend might otherwise keep longer-lived object past their race-losing purge in the backends.
  2. Due to the above, we can't accurately measure this at the frontend layer like we do with cache_text, as all served objects there have 1h TTL or less regardless of how long they live in the backends. Therefore the statistics I pulled were from an eqiad backend instance's cache hits, which is backending all datacenters.

All of that said, the results of binning up the Age: values coming out of an eqiad backend for cache hits only looks like:

Total: 375009
1s+:  99.99%
1m+:  99.77%
1h+:  88.55%
4h+:  54.53%
12h+: 0.01%
1d+:  0.01%
7d+:  0.01%
14d+: 0 (actually 0, not just rounded to 0.00%)

The dropoff somewhere between 4 and 12 hours could be the result of the total set of unique URLs commonly fetched simply not fitting in the total hashed backend storage, leading to a naturally-short cache rollover time. Our total hashed backend storage in eqiad is ballpark 9.36TB, which is further split into subsets for giant objects and regular-sized objects (the split is at 100MB size limit, and ~17% goes to larger objects for ~1.5TB, and 83% to smaller ones for ~7.7TB).

In the net, we know from previous measurements that the cache object hitrate for the upload cluster is ~98% (counting hits at any layer as a hit), so I know we're not in performance trouble from lifetimes and/or LRU eviction in the general case.

I re-ran the parsing script over the exact same input data as the last results, with finer-grained detail on the 4-12h range (I had captured the output at an intermediate stage of the pipeline just in case):

Total: 375009
1s+:  99.99%
1m+:  99.77%
1h+:  88.55%
2h+:  76.09%
3h+:  64.80%
4h+:  54.53%
5h+:  45.59%
6h+:  37.42%
7h+:  29.80%
8h+:  22.96%
9h+:  16.56%
10h+: 10.76%
11h+: 5.23%
12h+: 0.01%
1d+:  0.01%
7d+:  0.01%
14d+: 0

The time falloff there does seem somewhat "natural" in its pattern, although the fact that the natural pattern winds down at exactly 12h is a little smelly of some other limitation there...

next upload datapoint is this. This is an esams backend instance (pulls from eqiad, gets requests from esams frontends):

Total: 704980
1s+:  100.00%
1m+:  99.95%
1h+:  98.87%
2h+:  97.52%
3h+:  95.62%
4h+:  93.46%
5h+:  91.03%
6h+:  88.42%
7h+:  85.60%
8h+:  82.64%
9h+:  79.47%
10h+: 76.23%
11h+: 73.04%
12h+: 69.99%
1d+:  32.63%
7d+:  0.00%
14d+: 0 (truly zero)

Things to note about esams:

  1. This data was for hits where the hit was at eqiad or esams backends.
  2. While esams isn't tier-1 (it backends to eqiad caches), the total hashed storage in esams is 11.5TB vs eqiad's 9.36TB.

The implications I see here are:

  1. The extra 2.2TB of storage moves the natural cache rollover times out a bit, so that they seem to be tapering down to zero-ish at ~2 days instead of 12 hours, but otherwise the pattern is similar.
  2. We're still getting zero hits at 14days+ ... ?

(note I've edited some of my cache_upload commentary above to remove questions/mysteries that turned out to mostly be my own braindeadness)

So, I've figured out some of the things that were confusing me yesterday. To recap that:

  1. I now question and need to investigate whether our TTL caps are really effective in the first place. In practice not many hits live as long as the caps anyways, but I think the previous thinking (that capping beresp.ttl on fetch at all layers is effective) is wrong. Capping beresp.ttl may affect the TTL in the local cache, but I don't think it actually affects the cacheability headers sent to the next layer up the chain. So we could, in fact, see objects live longer than the TTL cap in total with our current VCL. In other words, capping at 30 days of life at each of 3 layers of varnishd could equate to an effective 90 day cap when we're talking about absolute limits. This is not the first time I've been confused about related things, though - needs more investigation.
  2. swift doesn't send any cacheability info, so the default is going to be the varnishd default_ttl setting, which is currently 3 days.
  3. The beresp.ttl fixed/cap settings at various upload layers/tiers have to consider that effect. That's why the 1h cap on upload-frontend works at all: the object arrives with no TTL (no cacheability headers), defaults to 3 days, then gets capped down to 1h. For the backends, the current ttl_fixed + ttl_capped set it to 30 days independently at each layer (which could theoretically be additive as in (1) above), but we can't just remove the ttl_fixed at the tier-2 backends to fix that, as that would revert to default_ttl of 3 days.
Andrew triaged this task as Medium priority.Apr 14 2016, 9:05 PM

Change 287109 had a related patch set uploaded (by BBlack):
VCL: cap all TTLs at 14d (or less in existing cases)

https://gerrit.wikimedia.org/r/287109

We're overdue to circle back to this, but there's also a lot of investigating and thinking left to do, and IMHO the varnish4 transition as well as the Surrogate-Control ideas ( T50835 ) play into this as well. I think we're ultimately going to solve this problem with varnish4 and some custom Surrogate-Control stuff that's initially just inter-cache, and later we can expand that to supporting it from MediaWiki as well. For now, I think further-dropping the text TTL cap from 21d to 14d, and dropping the upload cap from 30d to 14d, as in the patch above, will be an improvement and possibly help patch over any current fallouts.

Change 287109 merged by BBlack:
VCL: cap all TTLs at 14d (or less in existing cases)

https://gerrit.wikimedia.org/r/287109

Change 291059 had a related patch set uploaded (by BBlack):
VCL: lower TTL caps from 14 to 7 days

https://gerrit.wikimedia.org/r/291059

Change 291059 merged by BBlack:
VCL: lower TTL caps from 14 to 7 days

https://gerrit.wikimedia.org/r/291059

Change 291220 had a related patch set uploaded (by BBlack):
cache_text: cap frontend TTL at 1d

https://gerrit.wikimedia.org/r/291220

Change 291220 merged by BBlack:
cache_text: cap frontend TTL at 1d

https://gerrit.wikimedia.org/r/291220

Change 295007 had a related patch set uploaded (by BBlack):
cache_upload: experiment with 4h fe ttl cap

https://gerrit.wikimedia.org/r/295007

Change 295007 merged by BBlack:
cache_upload: experiment with 4h fe ttl cap

https://gerrit.wikimedia.org/r/295007

How does the cache ttl of Varnish interact with the concept of 304 renewals?

I remember in the past we often had bugs where a cache object had expired (but not yet garbage collected) at which point Varnish does (and should) make a request to the backend with a If-Modified-Since header. At this point, MediaWiki would respond with 304 Not Modified (since the page wasn't edited since that timestamp), and Varnish would renew the cache object.

This would cause data that is not strictly versioned to go stale indefinitely:

  • Skin html.
  • Links from that html to other static files (e.g. powered-by image).
  • Anchor links in the navigation sidebar (configurable through MediaWiki:Sidebar, and extendable from PHP extensions as well, which can get deployed or undeployed, e.g. WikimediaShopLink).
  • Translated interface messages such as "View history" etc.

I don't know if that problem ever got fixed, but if it isn't, then merely lowering the ttl in Varnish is not enough to unblock T127328.

Note, this "304 renewal" behaviour is very much intended and required in general. (The whole point of 304 is that you determine freshness without computing and transferring the whole page again). Max-age (in Cache-Control) isn't about how long a client stores the content. It's about how long the client may blindly use the content without checking with the server (=304).

However if we want fault tolerance and easy migration, max-age isn't the way to do it. A low max-age does not mean that broken html will roll over after it expires. It also doesn't mean that it's safe to remove "unused" end points 30 days after we no longer emit them. So let's make sure that we understand what this "ttl" means exactly, and if needed, we may need a second behaviour mechanism that (when reached) would result in Varnish requesting the backend without If-Modified-Since/If-None-Match header.

Varnish 3 and 4 may differ a bit on 304 basics, and Varnish 4 clearly does a better job of managing grace-mode in general, and using it for 304-refreshes, and my current recollections may be more Varnish4-tainted and miss something about Varnish3 without digging deeper. All that being said:

  1. Yes, in general varnish will re-use stale objects for 304-refresh from backends.
  1. I don't think it uses any random object that happens to still exist in storage. It re-uses objects that are still in their grace time, and once they're out of grace they're gone for all practical purposes, regardless of low-level storage GC/reuse.

So if an object has 7 days of real TTL and an additional 1 day of grace time, if it receives a request during the 8th day that could theoretically have used the stale object, it does an conditional request (e.g. IMS, or maybe even ETag) to the backend. If conditional request gives a 304, it refreshes the life of the stale object, reusing the content, and updates the relevant headers from the ones that came with the 304. If there was no request during the 8th day, a request on the 9th day would be a normal cache miss.

IMHO, if MediaWiki is handing out illegitimate 304s in response to conditional requests (saying something is Not Modified when it was, in fact, modified), then that's the bug to be fixed here.

I should have noted above: our current maximum grace is 1 hour beyond whatever the TTL is. Basically we're really not using grace very effectively today, but it's enough to be sure we handle the overlap well on fairly hot items that need to be refreshed occasionally.

@BBlack I agree that technically "Not Modified" is a lie from MediaWiki in that case, but I'm not convinced that behaviour is wrong or needs changing.

In many cases Not Modified means "not *significantly* changed". For two reasons:

  • Computational overhead to determine exact changes.
  • Impact of global cache invalidation on insignificant changes.

All of the below are examples of things that technically change the HTML output, but are not currently tracked (they are effectively stateless and just happen in whatever way they are currently configured - unlike content revisions, which have a timestamp and a revision ID).

  • Vector skin HTML.
    • e.g. wgReferrerPolicy and other things in wmf-config.
  • Static file references
    • e.g. bits.wikimedia.org > $wikidomain, $wikidomain.org/static/1.28-$version > $wikidomain/w.
  • Sidebar configuration.
    • e.g. installing or disabling WikimediaShopLink.
  • Any interface message.
  • Much more...

The only way to reliably track these is to essentially forego the optimisation for 304 responses, do a full page render, and make a hash digest (and use ETag to communicate it). It also would effectively lead to a full cache invalidation if anything changes anywhere. (Though Varnish and browsers would still be allowed to unconditionally cache for the ttl duration, after that it would always cause a fresh page render to happen in the backend, though it wouldn't need to be transferred per se).

The computational overhead is probably manageable given that the majority of it already happens anyway (overhead to contacting Apache backends, initialising MediaWiki WebStart, making several db queries). The wrapping the output is non-trivial, but manageable.

The impact of cache-invalidation may be undesirable though. But the more I think about it, it may not be that bad actually.

Well, it's certainly legal from some point of view. But if you want to claim Not Modified on what are considered minor non-breaking changes then you have to live with the consequences that old content may live on indefinitely due to 304-refresh.

If there are content updates that affect broad swaths of content non-critically (like the examples you mention), couldn't we simply (a) not PURGE all related things immediately from traffic caches and (b) update the IMS timestamp (or ETag) when the parsercache entry is regenerated for each item affected by the change, and store that timestamp/etag with the parsercache output? I assume that's a slow/throttled process for massive updates, and it would let 304 still work correctly and efficiently. As items affected by such changes naturally fall out of TTL time in the caches, they'll get new data if the throttled parsercache update has already hit those objects. It puts an upper bound on how old things can get: up to $total_traffic_TTL after the slow parsercache update is done for a given change.

For as long as I can remember (at least 6 years), we've made countless breaking changes based on the basic assumption that caches roll over within ttl ("30 days").

For example, earlier today. From https://gerrit.wikimedia.org/r/#/c/295613/2//COMMIT_MSG

@tstarling wrote:

Also, in RaggettWrapper, switch to the new class mw-empty-elt, following
Html5Depurate, instead of mw-empty-li. The old class can be removed once
HTML caches have expired.

In this case, we're changing the parser output (as usual, without explicitly invalidating the parser cache key and purging all Varnish HTML cache). And expecting to safely remove the CSS declaration for the old output once the caches have expired. The url response from which the CSS is served will "modify" when that happens, and thus affect all cached content. Even once parser cache has rolled over (which does truly roll over, given that it isn't HTTP based, but purely TTL/LRU based) - per T124954#2399694, Varnish will happily renew the old parser output from its stale content over 304, and live on. For another ttl period, and again etc. unless the page is edited or otherwise purged.

Yeah it's not great, but what do you expect to happen? That's what we're telling Varnish to do based on the standards. This is the timeline we're talking about (just using a generic integer counter as time moving forward):

  1. Fresh object X is generated in MediaWiki.
  2. Varnish fetches X and gets some positive TTL N for caching.
  3. The underlying object changes in MediaWiki before the TTL is even up
  4. X's TTL expires, at which point there's a small grace-window for "stale-while-revalidate" type behavior (so that new content can be fetched (or existing re-validated) without stalling out clients).
  5. Varnish asks if X has been modified since it was last fetched in (2)
  6. Mediawiki says "304 - No, it hasn't, and you can cache it again for another TTL N" <- This is a lie, and if you tell this lie Varnish is going to believe you
  7. Varnish happily refreshes headers/timestamps on the existing object for new clients going forward, putting it in the same basic state it had in (2); it can infinitely loop through these steps.

There are mitigating factors that probably make it unlikely that a bad object gets stuck in this cycle repeatedly:

  1. While our maximum grace in grepping our VCL is 60 minutes, that's only on detection of an unhealthy backend, and our default grace is actually 5 minutes, so that's what applies most of the time. An object has to be hot enough to be requested during the 5 minute grace window at the end of its natural expiry to have a chance at the above. If it misses the 5-minute window it's gone for good and the first requesting client has to stall on reloading whole new content into the cache.
  2. Objects can be pushed out of cache storage before they naturally expire (by newer objects) - surviving in the face of this depends, again, on hotness.
  3. We do wipe caches over time, irregularly, due to maintenance. The frontends more often than the backends.

The ones to worry about the most are the very hot objects that we know never go 5 minutes without a fetch somewhere.

Why don't we update IMS timestamp or ETag when cached parser output actually-changes from slow rollover?

Why don't we update IMS timestamp or ETag when cached parser output actually-changes [..]

There is no detection of that kind of change. We don't version the Parser right now. And even if we would, we'd have to somehow salt it with all relevant configuration and list of activated extensions to be precise. Similar to how it's impractical to do a full hash of the skin output (see previous comment), doing so for parser output would equally require a lot of state tracking and/or doing all the computations we're trying to save in the first place.

Alternatively, we could change MediaWiki to enforce that cached responses will not be used beyond the intended max-age. We'd compare to max(revision.timestamp, now - maxage instead of just revision.timestamp. That effectively means that if the last tracked change was more than (maxage) in the past, we'll return false from the If-Modified-Since check and respond with a regenerated 200 OK.

Edit: Looks like we do that already! (Done for T46570, which is an example of the kind of bug that happens when secondary content goes stale due 304-renewed).

https://github.com/wikimedia/mediawiki/blob/212e40d/includes/OutputPage.php#L773-L786
https://github.com/wikimedia/mediawiki/blob/212e40d/includes/api/ApiMain.php#L1165-L1178

$lastMod = $module->getConditionalRequestData( 'last-modified' );
if ( $lastMod !== null ) {
	$modifiedTimes = [
		'page' => $lastMod,
		'user' => $this->getUser()->getTouched(),
		'epoch' => $this->getConfig()->get( 'CacheEpoch' ),
	];
	if ( $this->getConfig()->get( 'UseSquid' ) ) {
		// T46570: Stateless data can still change even if the wiki page did not
		$modifiedTimes['sepoch'] = wfTimestamp(
			TS_MW, time() - $this->getConfig()->get( 'SquidMaxage' )
		);
	}
	$lastMod = max( $modifiedTimes );

So, with that, we just need to decide what to do with $wgSquidMaxage in wmf-config. That is the effective config for how long untracked content may be serve to users, not Varnish ttl. That maxage is what we should look at when removing unused server endpoints, unused styles, etc. For most purposes, the length of this is merely an inconvenience (shorter allows faster iteration, but migration works either way). For T127328 to be unblocked however, we need it to actually be low enough since it's not about migration but about freshness of styles across content. Ideally as low as Varnish ttl (24 hours).

Change 296495 had a related patch set uploaded (by Krinkle):
Lower $wgSquidMaxage to 1 day for test2wiki

https://gerrit.wikimedia.org/r/296495

Change 296495 merged by jenkins-bot:
Lower $wgSquidMaxage to 1 day for test2wiki

https://gerrit.wikimedia.org/r/296495

@Krinkle - the varnish TTL cap is *per layer*, and it's still 7 days in the backend layers (it's only 1 day in the frontend layers). If the test2wiki change is intended to go to production, IMHO it's not a good idea to drop the squid maxage to 1 day. It needs to at least be 7 days, but I'd start higher than that (14?) until we get past Varnish4 transition for text and can make grace-mode behaviors work better.

Re: detecting parser output changes, couldn't we just do a hash over the output to generate an ETag?

Change 296765 had a related patch set uploaded (by Krinkle):
Set $wgSquidMaxage to 14 days on test2wiki

https://gerrit.wikimedia.org/r/296765

Re: detecting parser output changes, couldn't we just do a hash over the output to generate an ETag?

That's a paradox. If we do that, we'd have to validate the ETag on an If-None-Match request by invoking the parser and extension hooks those backend requests, hash the output of that and compare the hash. That would be rather expensive.

Most things previously mentioned and dozens more aspects of a MediaWiki page response are actually not even in the parser output cache. They're also not versioned in a way accessible to the run-time. To verify nothing changed one'd have to build the whole page. Since that's too expensive, we've essentially decided long ago to instead only track the critical portion (revision content). The rest is still important, but as long as we can be sure that html responses unconditionally expire and regenerate-on-demand after X time - it's fine. Slow deployment is acceptable for those, as long as they do get universally deployed, eventually, and within a predictable timeframe.

That timeframe has historically been 31 days. Until last year we did leak a fair amount beyond 31 days due to 304-renewals, but that was fixed after T46570 by forcing Last-Modified to be max(revision.timestamp, cacheEpoch, now-smaxage).

In most cases, changes of this kind are not very noticeable and okay to roll out gradually over our content (e.g. for up to 30 days, different articles may have either the old or new version). For example, migration from bits urls to local load.php was okay to roll out slowly. For more user-visible aspects, we tend to use CSS - in which case they do roll out globally at once since that's a separate request url. However that's changing with T127328, which will make some styles into the html.

A few years ago we some major changes to the Vector skin - and for a month users perceived an alternating layout from one page to another. I hope to avoid that in the future with T111588 (which, like ESI, applies the skin as separate cacheable entity at the edge).


Anyway, back to the topic of this task. Let's start by lowering smaxage from MediaWiki to 14 days?

Change 296765 merged by jenkins-bot:
Set $wgSquidMaxage to 14 days on test2wiki

https://gerrit.wikimedia.org/r/296765

@Krinkle - I think 14d for the maximum s-maxage MW advertises to Varnish is fine for now. We'd obviously like to, in the long run, get the effective lifetimes even lower (both enforced in Varnish, and in the s-maxage or similar from MW), but I don't think it's safe to go much lower until we get through the V4 transition and switch to proper use of Surrogate-Control between layers and using grace-mode correctly to handle the datacenter/network outage cases (as in, have the "normal" TTLs down somewhere in the 1d range, but have grace-mode capable of using stale objects in emergencies for a week).

Change 298968 had a related patch set uploaded (by BBlack):
cache_upload: 1d FE TTL cap

https://gerrit.wikimedia.org/r/298968

Change 298970 had a related patch set uploaded (by BBlack):
cache_misc: raise default_ttl to 1h

https://gerrit.wikimedia.org/r/298970

Change 298968 merged by BBlack:
cache_upload: 1d FE TTL cap

https://gerrit.wikimedia.org/r/298968

Change 298970 merged by BBlack:
cache_misc: raise default_ttl to 1h

https://gerrit.wikimedia.org/r/298970

Change 299153 had a related patch set uploaded (by Krinkle):
Lower default $wgSquidMaxage from 31 days to 14 days

https://gerrit.wikimedia.org/r/299153

Change 299153 merged by jenkins-bot:
Lower default $wgSquidMaxage from 31 days to 14 days

https://gerrit.wikimedia.org/r/299153

Change 343845 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: swap around backend ttl cap and keep values [2/2]

https://gerrit.wikimedia.org/r/343845

Change 343844 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: swap around backend ttl cap and keep values [1/2]

https://gerrit.wikimedia.org/r/343844

Change 343845 merged by Ema:
[operations/puppet@production] varnish: swap around backend ttl cap and keep values [2/2]

https://gerrit.wikimedia.org/r/343845

Recap of recent progress: where we're at now is a hard cap of 1 day TTL within each cache layer, regardless of any longer max-age sent by the application layer. Depending on the user's geographic location, there can be anywhere from 2 to 4 cache layers involved in their request. In edge cases with hot items the per-layer TTL cap behavior will have a natural race condition which under uncommon conditions could cause the total TTL of the caching stack to be multiplied by the number of layers, resulting in 2-4 days of total TTL before the object is fully expired for all users.

We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a 304 Not Modified response to a conditional request from the cache, for content which has in fact been modified).

Our next step here is to begin using Surrogate-Control headers for inter-cache communication of capped TTLs, which will remove the layer-multiplication issues and give us hard limit for the total cache stack at 1 full day. There are some interactions between that work and related grace/keep issues (calculating cache-local ttl and grace values as percentages of the total TTL, etc), so they should probably be tackled in tandem.

[..] We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a 304 Not Modified response to a conditional request from the cache, for content which has in fact been modified).

Is there an upper limit to how long or how often the same cache object can be "304-whitewashed"? (E.g. as long as it keeps being requested in the grace period between ttl expiring and object actually being removed from storage).

I assume that it does allow infinite white-washing, and that that is by design. As I understand it:

  • ttl is how long the object is considered fresh.
  • grace is how long to keep it around so that it may be served as stale object to the user while the currently stale object is either being renewed (by a 304 Not Modified response) or replaced (by a 200 OK response).

I've recently seen new Varnish configuration for a property called obj.keep / beresp.keep. It's unclear to me how keep fits in with this. If an object is beyond ttl+grace, what purpose will the object serve? I suppose the only remaining use is, if a request is made after ttl+grace but within keep it can be used to renew the object if the next user request yields a 304 response.

MediaWiki quite often responds with a 304 Not Modified when the response is in fact different because we only track the internal wiki page content as means for validating If-Modified-Since. Changes to MediaWiki core output format, WMF configuration changes, and changes to the Skin, are not tracked in a way that the application is aware of. And besides, we wouldn't want to reject the entire global cache every time a minor change or configuration change happens. For the most part, the architecture design for large-scale MediaWiki deployments is that all state outside the actual revision history of wiki pages is observed as static. And we rely on cache expiry to base compatibility decisions, such as:

  • How long to keep CSS or JS code around for HTML compatibility? (E.g. when changing something in the HTML output that is styled by CSS or enhanced by JS, we keep both the old and new CSS/JS around until we believe any previously generated HTML has dropped out of the CDN caches.)
  • How long before we remove a file from /static after updating MediaWiki configuration to output references to a different file.

This kind of decision happens almost every week. And for that, we need a high-confidence threshold for how long cache is supposed to take to fully turn over. In extreme cases we'll get real data (e.g. tail varnishlog, query wmf.webrequest in Hive, ad-hoc use of statsv or EventLogging), but doing that every time doesn't scale. (And shouldn't be needed.)

Historically, the upper limit was a month ("31 days"). Last year this was lowered to 14 days. Over the last few months, some people assumed it to be 7 days, 5 days, or 4 days, but I'm holding on to "14 days" until I hear otherwise.

Assuming infinite white-washing, the upper limit is effectively decided by wgSquidMaxage. This is currently 14 days. MediaWiki will always generate a fresh response when the previously stored object is older than this. Precisely to ensure "static" changes (e.g. Skin layout, config changes, core features etc.) will propagate eventually.

Should we lower $wgSquidMaxage to, say, 5 days? That would give it a day breathing room from Varnish perspective (4 days), while still confidently under the deployment frequently (7 days) – which would allow us to reduce HTML-compat to 1 week instead of 3 weeks (rounding up).

[..] We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a 304 Not Modified response to a conditional request from the cache, for content which has in fact been modified).

Is there an upper limit to how long or how often the same cache object can be "304-whitewashed"?

As far as I know, there's no upper limit and Varnish will infinitely whitewash via 304 so long as an object is within its total keep time each time it needs to refresh. The infinite cycle would stop if the object ever went un-accessed long enough (e.g. over a week). To recap varnish behavior, the 3 values in play are ttl, grace, and keep, and they add up serially (the timers do not run concurrently). TTL is the basic lifetime of the object. After the TTL is expired, if the object is still within the grace period it can be served stale to a user while the content is refreshed in the background (possibly via conditional request, if applicable). Once the grace period has expired, the object can remain valid in storage for the duration of the keep timer, during which the contents can only be used as the source of a conditional, synchronous verification to the applayer looking for a 304 to refresh the life of the contents (saving transfer bandwidth and storage churn vs a 200). Our current settings are to cap the application-provided TTL to 1-day, fixed grace period of 5 minutes, and cap the keep value to an additional 7 days (if the app-provided TTL is <7d, the keep value currently gets lowered to the app-provided TTL, to help minimize bad-304 fallout with shorter-lived objects).

So, in the standard MediaWiki case of fresh page object with a 14d app-specified TTL, the backendmost cache will end up with ttl=1d + grace=5m + keep=7d for a total of 8d5m duration that the content is considered valid for a conditional refresh via 304, and the 304 cycle can repeat indefinitely AFAIK, keeping stale content alive forever if the application layer always claims it's still unmodified.

MediaWiki quite often responds with a 304 Not Modified when the response is in fact different

We'll obviously have to work with what we have today, but for the record this is Not Ok, and should probably be addressed in the future. It probably will continue to be a pain point with various future cache and/or proxy technologies. It's a real problem with HTTP semantics, and it's hard to ever hack around it in an appropriate way that doesn't introduce other subtle issues. Referencing the justification below (because I'm leaving this argument aside for the remainder) you wouldn't have to invalidate the whole global Varnish cache every time a minor skin change happens to have the 304 mechanism work correctly. Skin updates that affect the main page output could correctly change the conditional responses of MediaWiki without sending an explicit purge to Varnish. The existing objects would still get their normal cache lifetimes and refresh correctly to the new Skin as they expire from their normal TTLs.

The other side of the issue is erring on the safe side of the equasion as we do today (effectively invalidating for conditional refresh all objects older than $wqSquidMaxage). While it's not a semantic problem any more than simply never issuing 304s would be, it's also potentially an unnecessary cause of performance meltdown. There could be cases where we'd hope to rely on 304s to avoid transfer bursts to the caches, but we're getting a full 200 on content that didn't happen to change across that artificial barrier in time. The ideal we'd hope for is that conditional-request semantics apply exactly correctly.

because we only track the internal wiki page content as means for validating If-Modified-Since. Changes to MediaWiki core output format, WMF configuration changes, and changes to the Skin, are not tracked in a way that the application is aware of. And besides, we wouldn't want to reject the entire global cache every time a minor change or configuration change happens. For the most part, the architecture design for large-scale MediaWiki deployments is that all state outside the actual revision history of wiki pages is observed as static. And we rely on cache expiry to base compatibility decisions, such as:

  • How long to keep CSS or JS code around for HTML compatibility? (E.g. when changing something in the HTML output that is styled by CSS or enhanced by JS, we keep both the old and new CSS/JS around until we believe any previously generated HTML has dropped out of the CDN caches.)
  • How long before we remove a file from /static after updating MediaWiki configuration to output references to a different file.

This kind of decision happens almost every week. And for that, we need a high-confidence threshold for how long cache is supposed to take to fully turn over. In extreme cases we'll get real data (e.g. tail varnishlog, query wmf.webrequest in Hive, ad-hoc use of statsv or EventLogging), but doing that every time doesn't scale. (And shouldn't be needed.)

Right, because we're versioning these files, and therefore the core page output changes every time they change, to update the versioning hash in the link reference?

Historically, the upper limit was a month ("31 days"). Last year this was lowered to 14 days. Over the last few months, some people assumed it to be 7 days, 5 days, or 4 days, but I'm holding on to "14 days" until I hear otherwise.

Assuming infinite white-washing, the upper limit is effectively decided by wgSquidMaxage. This is currently 14 days. MediaWiki will always generate a fresh response when the previously stored object is older than this. Precisely to ensure "static" changes (e.g. Skin layout, config changes, core features etc.) will propagate eventually.

Should we lower $wgSquidMaxage to, say, 5 days? That would give it a day breathing room from Varnish perspective (4 days), while still confidently under the deployment frequently (7 days) – which would allow us to reduce HTML-compat to 1 week instead of 3 weeks (rounding up).

It's complicated because $wgSquidMaxage is actually controlling a few different things: the max age sent to Varnish as a TTL signal, the maximum age for which MW will continue conditionally-verifying content that may have changed due to meta-level changes (Skin, etc), and thus also the artificial barrier after which MW will no longer conditionally-verify content that hasn't changed. It's also indirectly controlling our keep-reducing hack, which isn't great since we're hoping the keep values save us from cache meltdown when we have our now-short-TTL caches offline for 1-7d periods (by reducing burst transfer on repool).

I'd propose for now to:

  • Change $wgSquidMaxAge to 7 days. If you were to go any lower, it would again cause us burst-transfer problems with our 1-week timeline, because MW is going to consider everything older than this value 304-invalid even if it hadn't changed.

We're still going to aim for 1d TTLs in our Varnishes in general, but given the 304 issues and our need for it to work correctly to appropriately handle maintenance and outages, MW's wgSquidMaxAge really shouldn't go under 7d at this time, and is also the only TTL you can rely on for things like removing old versioned static files.


Separately, I'd like to eliminate (or at least slightly fix) our "cap the keep value to the TTL" hack, since it doesn't work right on a number of levels.

Since MediaWiki is the only complicated case we care a lot about (the reason we went with the paranoid keep-reduction on short TTLs), if we could verify that there aren't other 304 mis-behaviors from MW that matter for other short-lived objects (e.g. RL? cacheable short-TTL MW API output cases?), I'd propose we move forward with just using a fixed 7-day keep value as the simplest answer. Alternatively, if there are other shorter-TTL objects that do have 304-misbehavior, we could consider trying to use the actual CC:s-maxage value as a cap on the keep value, rather than the current TTL. But this wouldn't work either for outputs I'm observing today because of another oddity: when serving "old" objects, MW seems to count down the TTL in the CC:s-maxage field, when the more-correct behavior would be to keep the CC:-s-maxage field constant at the $wgSquidMaxage value and count up an Age: output header.

In example terms, what we expect is:

[fresh object just parsed for the first time]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
Age: 0

[next request for same object, 60s later]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
Age: 60

What we seem to get from MW is:

[fresh object just parsed for the first time]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
[no Age header]

[next request for same object, 60s later]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209540
[no Age header]

Since Age is implicitly zero, the calculated TTL of the object (CC:s-maxage - Age) is the same, but this denies us the ability to see an object's policy-based max-age, which is useful information when we're trying to do something intelligent with grace and keep behaviors, as there's a big difference between a 2-week-age type of object that has 10 seconds of life left and a freshly generated object that only ever gets to live for 10 seconds.

BBlack claimed this task.

Closing this ticket as it's getting rather long in the tooth. We did reduce our TTL caps down to 1d across the board at all layers, with up to ~7d keep times, and that did accomplish a lot of what was desired here. Further work on rationalizing MediaWiki's output behaviors is complicated to even comprehend fully and not directly related, maybe new tickets should be filed about that.