Page MenuHomePhabricator

FileBackendMultiWrite multi-dc and thumbnail handling
Closed, ResolvedPublic

Description

FileBackendMultiWrite is a file backend in MediaWiki that can write to multiple desitnations, that we currently use in production; it makes a few assumptions, namely:

  • There is one "master" datacenter
  • Objects get unidirectionally synced from the master to the replica
  • We write/delete from both locations
  • We only read from the master datacenter for simplicity

The "master" datacenter is determined by configuration; we're currently following MediaWiki's active datacenter.

When it comes to thumbnails handling, this is problematic:

  • Thumbnails are read directly from the object storage for serving to the public, with no interaction with MediaWiki; if the thumbnail at the desired size is not present, the object storage will call thumbor via its not found handler. This is local to the datacenter where the request was directed from the CDN; This means that we can have different thumbnails generated in the two datacenters for the same image
  • Thumbnails are pre-generated via a job at certain given sizes, but only in the nearest datacenter where swift is pooled in discovery DNS (the request will be sent to the CDN, and geolocated to the nearest available datacenter to where we're calling from)
  • When someone reuploads an image, Filerepo::LocalFile::getThumbnails is called to find which thumbnails are present and should be purged. This operation calls FileBackendMultiWrite::getFileList which lists the thumbnails present in the current "master" datacenter. Then a delete command to purge the thumbnails is sent to both datacenters

Making things worse, we used to have some form of syncing of thumbnails from the master DC to the other one; this was stopped in the past as it was deemed useless in our new multi-dc setup.

This means two big inconsistencies are created:

  • If the list of thumbnails in the two datacenters differs, some thumbnails at the old version of the image will be left behind and not invalidated
  • If/when we switch datacenter for swift and/or mediawki at the DNS discovery layer, we might end up not cleaning up even the pregenerated thumbnails, giving a very confusing UX for the uploader (see the parent task).

What would fix the issue

  1. FileBackendMultiWrite needs to be able to get separated lists of files for thumbnails on different backends, and I suspect the only way to do that cleanly is to add a new method specifically to list files by backend. Anything short of that would not consistently purge thumbnails from all backends.
  2. ThumbnailRenderJob needs to be able to hit the pregeneration in all backends, not just the one closest to the job execution

What can we do to reduce the issue impact while the above issues are fixed

  1. Add an additional variable in etcd to indicate to mediawiki where is the swift "master" (meaning: which is the direction of replication of originals)
  2. We can ensure that swift, unless we're in an emergency, is always pooled in DNS in the datacenter where mediawiki is master, and that this datacenter is marked as "master" accordingly.
  3. We can maybe turn on again the thumbnail syncing, the direction of which will have to be carefully determined.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

From a Data Persistence POV, thumbs are ephemeral / cached - we are going to start expiring them at some point, and the large size of the thumb containers makes running e.g. rclone on them unattractive at best. I'm not very happy about the idea of starting to consider them Real Objects, if you see what I mean.

Even if they're ephemeral, which I think is an ok assumption (thumbnails are caching), you need to be able to consistently invalidate them when an image is deleted/replaced, which we don't do. I think that is more important than adding a TTL to this cache.

Just think of the fact we might not delete thumbnails of images that were deleted for legal reasons.

Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) then it should happen in all datacenters, not just the one where swift is "master".

@thcipriani what does the Unstewarded-production-error tag mean, in practice? Is there a process to get someone to look into this?

I'm not sure I was able to explain how serious this problem is, and how legally dangerous it is not to fix it.

In T331138#8675244, @Joe wrote:

From a Data Persistence POV, thumbs are ephemeral / cached - we are going to start expiring them at some point, and the large size of the thumb containers makes running e.g. rclone on them unattractive at best. I'm not very happy about the idea of starting to consider them Real Objects, if you see what I mean.

Even if they're ephemeral, which I think is an ok assumption (thumbnails are caching), you need to be able to consistently invalidate them when an image is deleted/replaced, which we don't do. I think that is more important than adding a TTL to this cache.

I don't disagree that we need to be able to consistently invalidate thumbs.

Just think of the fact we might not delete thumbnails of images that were deleted for legal reasons.

T&S already have somewhat-separate processes for images deleted for legal reasons (that e.g. involve clearing them out of backups); I don't know the details (or what it currently does about thumbs).

In T331138#8675245, @Joe wrote:

Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) then it should happen in all datacenters, not just the one where swift is "master".

I have a feeling @Ladsgroup [on leave all month] did look at this when we adjusted the sizes of some of the thumbs we were pre-generating and using (to make them a bit more standard).

One further thought - it would be nice if we could take swift's special 404-handler out of the equation, and have our object store be more "vanilla" in operation[0]. So e.g. have thumbnail requests be served from thumbor (or equivalent), and have that service worry about the caching/invalidation/etc. This task might not be the place to address that, but it seemed worth flagging if we're going to expend effort on rethinking how thumbnails are dealt with.

[0] the wmf-specific rewrite middleware in the midst of swift is a source of bugs, and makes it harder for us to consider alternative object storage solutions.

wrt to pre render. I'm assuming that is wgUploadThumbnailRenderMap at work. There are details on the various sizes in use etc at: https://www.mediawiki.org/wiki/Common_thumbnail_sizes. it is indeed a bit of a mess.

In T331138#8675245, @Joe wrote:

Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) then it should happen in all datacenters, not just the one where swift is "master".

I have a feeling @Ladsgroup [on leave all month] did look at this when we adjusted the sizes of some of the thumbs we were pre-generating and using (to make them a bit more standard).

(I'm not here, don't tell my manager 🙈)
I did take some numbers and provided some suggestions in https://phabricator.wikimedia.org/T211661#8377883

It is way more complicated than it looks. We have multiple buckets for standard thumbsizes. On top of that, Some wikis have different buckets too (they are small number of them though). The actual default needs to change as it's too small but we need to find a way to do it without bringing down upload cluster and swift. On top of that the bucket sizes don't support very small sizes that search thumbnails need leading to each product picking a random number (I overhauled and unified them but as a policy/coding convention. They are not in the list of pre-generated thumbsizes)

As I mentioned in the comment, the list of pre-generated thimbsizes is way too permissive and we can get rid of at least three sizes but removing the size shifts everyone's preferences as preferences are stored as index values of the array.

I have a detailed list of todo we can try to tackle. Will write them once I'm properly back. HTH for now

My bad. Pre-genarated ones are different from user perf ones. Two sizes can be dropped from pre-gen sizes with neglible impacting on users but don't touch the user pref thumb sizes

One further thought - it would be nice if we could take swift's special 404-handler out of the equation, and have our object store be more "vanilla" in operation[0]. So e.g. have thumbnail requests be served from thumbor (or equivalent), and have that service worry about the caching/invalidation/etc. This task might not be the place to address that, but it seemed worth flagging if we're going to expend effort on rethinking how thumbnails are dealt with.

[0] the wmf-specific rewrite middleware in the midst of swift is a source of bugs, and makes it harder for us to consider alternative object storage solutions.

I'm not really sure having thumbor front swift would work in any reasonable way. We'd have to write a service that would more or less do what the 404 handler does today, or am I missing something?

In T331138#8675246, @Joe wrote:

@thcipriani what does the Unstewarded-production-error tag mean, in practice? Is there a process to get someone to look into this?

I'm not sure I was able to explain how serious this problem is, and how legally dangerous it is not to fix it.

When trying to find someone to look into production issues I try to get a team tag and a component tag on the task.

I checked maintainer's page on MediaWiki to find a good team tag and came up empty—which is what Unstewarded-production-error indicates: A problem in production that has no official team assigned.

Next step—digging through the changelog/blame to find anyone that may be familiar and tagging specific likely people (which I have not done yet).

Krinkle renamed this task from `Filebackend::Multiwrite`, multi-dc and thumbnail handling to FileBackendMultiWrite multi-dc and thumbnail handling.Mar 10 2023, 5:18 AM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

No mention of a production error in this task.

No mention of a production error in this task.

Just to be sure I understood correctly: is fact we're not cleaning up stale thumbnails and we're still serving them to the public not a production error? Can you suggest a better tag then?

In T331138#8688891, @Joe wrote:

not cleaning up stale thumbnails and we're still serving them to the public not a production error? Can you suggest a better tag then?

Unstewarded-production-error is used to track tasks on Wikimedia-production-error that have no steward. The latter tracks cases where applications respond with HTTP 500 or otherwise produce exceptions and runtime error logs. That tag exists because we use that to monitor site health before and during deployments.

This task sounds to me like a bug. It's observed independently and reported as such. Arguably an important bug, but we have neither a special process nor a special category or tag for "important" bugs. There is the priority field I suppose, but thats more documentative than process, that is, raising the priority won't change who sees it or who acts on it. If there is no team assocated with this bug, or if the team is unresponsive, it'll need to be brought to their attention through other means, e.g. pinging by handles our reaching out on other communication channels.

In T331138#8675245, @Joe wrote:

Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) …

Yes.

Wikipedia probably really doesn't need pre-generated thumbnails because an article there has few total images and most of them are outside the viewport on page load (and they can be async / lazyloaded in most scenarios).

But things like Proofread Page has image "thumbnails" (think 1536px) as part of core workflow, and extracting a thumbnail from the DjVu or PDF file can take up to ~25 seconds. Page images from the DjVu/PDF are accessed sequentially and those 25 seconds are effectively added to the page load times (where otherwise the perf team cares about single digit milliseconds), so pregenerating all the thumbnails for a given PDF/DjVu file would be a massive (several orders of magnitude) improvement for Wikisource, and would also obviate the need for all the in-extension and client-side hacks being used now to prefetch the next image in the sequence and which are causing weirdness like unvisited pages being marked as read on the watchlist.

You also have use cases like "gallery" type pages (including special pages that list files) that try to display a few hundred thumbnails in one go, and currently rely on client-side throttling to avoid killing the server. These rarely manage to get loaded without two thirds showing as broken image icons (think the web cirka late nineties). Being able to statically serve these thumbs might make it possible to actually display such pages.

Proofread Page has the property that once transcription of the pages is done the thumbnails will probably not be accessed again in any time horizon we care about, and as such they can and should expire. But something gallery-like (Special:UnusedFiles or an actual Gallery page on Commons) can be high-traffic and frequently accessed, so the thumbs should be persisted / regenerated in something comparable to the same way the wikitext is rendered.

In both these it's probably some other component that needs to request pre-generating thumbs of relevant sizes, rather than the multimedia stack generally pregenerating for every media file, but the facility for it needs to live somewhere related to Swift/Thumbor/etc.

PS. Found my way to this task because it is presumably the reason I was getting 404s for page thumbs on a newly uploaded DjVu file, where adding 1px to the requested size made it load fine. So... end-user visible and with actual impact is what I'm saying.

I checked maintainer's page on MediaWiki to find a good team tag and came up empty

So far as I know, after the Multimedia team was disbanded a few years ago, the entire multimedia stack (including everything being discussed in this task) is entirely unsupported. Nobody owns multimedia, despite being core to every single project the WMF hosts, and to our GLAM partnerships, and being a right impressive part of the infrastructure costs (bandwidth, storage, even CPU I would guess).

This fact has been noted in incident reports, code stewardship reviews, and other venues with no apparent reaction or corrective measure afterwards.

In T331138#8698560, @Xover wrote:

I checked maintainer's page on MediaWiki to find a good team tag and came up empty

So far as I know, after the Multimedia team was disbanded a few years ago, the entire multimedia stack (including everything being discussed in this task) is entirely unsupported. Nobody owns multimedia, despite being core to every single project the WMF hosts, and to our GLAM partnerships, and being a right impressive part of the infrastructure costs (bandwidth, storage, even CPU I would guess).

This fact has been noted in incident reports, code stewardship reviews, and other venues with no apparent reaction or corrective measure afterwards.

At the risk of going off-topic, but posting this so that this particular thread can continue productively: @Xover, I would suggest commenting on https://commons.wikimedia.org/wiki/Commons_talk:Think_big_-_open_letter_about_Wikimedia_Commons#WMF_response_to_the_Wikimedia_Commons_open_letter or https://commons.wikimedia.org/wiki/Commons_talk:Product_and_technical_support_for_Commons_2022-23 and pinging people listed as involved staff to discuss the overall support level for the stack.

@KOfori FYI, this (and T330942) is the latest of the usual weekly mediawiki file workflow bug (read above), as you inquired about it recently.

There are still ongoing issues around thumbs; I noticed this morning that purging an image on commons (i.e. adding ?action=purge to the commons page) regenerated some of the thumbs in eqiad and strangely just one size of thumb in codfw...

The issue is exactly the problem outlined above.

People have been complaining that T330942 is not resolved, that is fixed. e.g. If you go to eqiad, the "master" filebackend is indeed swift in the primary dc (codfw) which in the previous case was the local swift (eqiad, T330942#8660782)

Here is an example, I tried using https://test.wikipedia.org/wiki/File:Wikitech-2021-blue-large-icon_(copy).png to debug (note that it's eqiad):

ladsgroup@mwdebug1002:~$ mwscript eval.php --wiki=testwiki --d 3
[debug] [memcached] MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
[debug] [memcached] MainWANObjectCache using store MemcachedPeclBagOStuff
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 0/testwiki
[debug] [rdbms] Wikimedia\Rdbms\LBFactory::getChronologyProtector: request info {
    "IPAddress": "127.0.0.1",
    "UserAgent": false,
    "ChronologyProtection": false,
    "ChronologyPositionIndex": 0,
    "ChronologyClientId": false
}
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::loadSessionPrimaryPos: executed chronology callback.
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: connecting to db1112...
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 4/
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server db1112 for group ''
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 4/testwiki
> $backend = MediaWiki\MediaWikiServices::getInstance()->getFileBackendGroup()->get( 'local-multiwrite' );

> $iterator = $backend->getFileList( [ 'dir' => "mwstore://local-multiwrite/local-thumb/2/2a/Wikitech-2021-blue-large-icon_(copy).png" ] );

> foreach ( $iterator as $file ) { var_dump( $file ); }
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.codfw.wmnet/auth/v1.0
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.codfw.wmnet/auth/v1.0 code=200 size=0 total=0.108473 connect=0.033915
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb code=200 size=xxx total=0.065867 connect=0.000053
string(47) "1024px-Wikitech-2021-blue-large-icon_(copy).png"
string(47) "1280px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "320px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "640px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "800px-Wikitech-2021-blue-large-icon_(copy).png"

You see it's hitting codfw (the primary dc right now)

But the thumb that is not updating is 120px and only exists in eqiad: https://upload.wikimedia.org/wikipedia/test/thumb/2/2a/Wikitech-2021-blue-large-icon_%28copy%29.png/120px-Wikitech-2021-blue-large-icon_%28copy%29.png?20230413110558

root@ms-fe1009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test-local-thumb
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1200px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/120px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/600px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/800px-Wikitech-2021-blue-large-icon_(copy).png

vs.

root@ms-fe2009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test-local-thumb
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1024px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1280px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/320px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/640px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/800px-Wikitech-2021-blue-large-icon_(copy).png

Basically what happens is that a user hits the thumbnail in eqiad but not codfw (by uploading/loading/etc. via the secondary datacenter), in reupload mw has a part to get a list of thumbnails for an image, that list only hits the primary dc's swift (it used to be dc's local swift, which was worse) and then removes them, I assume the removal gets propagated but again, if it doesn't show up in the list of files in the thumb container, it won't be deleted in primary (because it doesn't exist there) and the outdated one stays in the secondary dc.

Solutions:

  • Make thumb generation replicate in both directions. That can be done in swift and shouldn't be too hard to implement I think, I hope. (Databases do that for some specific cases, e.g. ParserCache and x2)
  • Add a job to delete thumbs in the secondary dc, it'll be complicated because I'm not sure we run jobs in the secondary dc, and even if we run it in primary dc hitting the remote swift, there is always the chance of race condition with users, or ThumbnailRenderJob. So I prefer the first one
  • Delegate deletion of thumbs to swift and replicate that. Instead of discovering all thumbnails (that it's clearly not doing a good job at it), make mw issue a command to swift to delete any thumb it has under the given file name, swift would replicate that command to the secondary dc
  • ?

Also, we really should get rid of three sizes in the pregen thumb sizes.

So, there still is no way to fix an existing thumb?

So, there still is no way to fix an existing thumb?

Until the actual issue is resolved, the community can request deletion of some outdated thumbnails somewhere (maybe a new ticket?) and I can delete them manually but this is clearly not sustainable.

Ouch. But if just deleting old thumbs works, does this mean one can just add a pixel to the image and reupload it now, and it will work this time?

You can do something simpler, if after reupload e.g. 200px thumb size is not updated, just use 205px thumbsize instead, that should trigger a new thumbnail generation.

Does not help, the bottom image still has labels on the stations instead of near them, as above for specific thumb measures.

Indeed, but how do you know which sizes work and which don't?

Try and error but again if it's not generated already, it'll be generated correctly so any non-obvious size should work, if you have 180px, 181px would very likely won't be used before.

In the meantime, I try to make a patch to fix the underlying problem.

It works, thanks a lot. I think we should explain this trick in the Tech News.

I also think it's important to know. Please publish it in Tech News.

Something as

There is a problem with svg files thumbnails creating. If you can see old version of image instead of the current one, change the size to a fresh one, for example 250px->251px. [[phab:T331138]]

maybe? My English is pretty bad, so maybe it should be edited.

But you can't change the size, if the thumbnail is listed on category pages or in the Commons file version list. You can't change all the thumbnails, if the new uploaded file version is already linked in hundreds of articles globally.

It's not really a solution to change the thumbnail size only.

It's not just SVGs, it's all file types, the example I gave was a png file but it only happens on half of the traffic depending on the geographical location.

Regardless, I'm hoping to fix this ASAP. Give me a bit.

Change 908632 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] filebackend: Find thumbnails from all backends in FileBackendMultiWrite

https://gerrit.wikimedia.org/r/908632

The patch above fixes the problem, tested in mwdebug2001. Now I need someone to review and merge it, I'll deploy it on Monday if it gets merged by then.

@Ladsgroup - Is there any way that folks can manually purge thumbnails that didn't get regenerated (besides reuploading the images)? Apparently action=purge doesn't do the trick. I imagine giving users blanket thumbnail purging capabilities would be a security concern, but maybe if it were limited to their own uploads it would be reasonable.

action=purge on the file description page is *supposed* to work, it just sometimes doesn't for reasons that are impossible to determine from the outside.

Like I could determine action=purge on file description page has a visible result in 40% of all usages, not more. And this is very little. And it doesn't help to purge this more than once.

I think the reason are the European server cluster stucks but I don't exactly know it.

Note that the fix hasn't been deployed yet (we need someone to review and merge it and then I can deploy the fix), once it's deployed, we can check if action=purge would help or not. But right now, you'd hit the same bug.

Change 908632 merged by jenkins-bot:

[mediawiki/core@master] filebackend: Find thumbnails from all backends in FileBackendMultiWrite

https://gerrit.wikimedia.org/r/908632

Change 908959 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.4] filebackend: Find thumbnails from all backends in FileBackendMultiWrite

https://gerrit.wikimedia.org/r/908959

Change 908959 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.4] filebackend: Find thumbnails from all backends in FileBackendMultiWrite

https://gerrit.wikimedia.org/r/908959

Mentioned in SAL (#wikimedia-operations) [2023-04-17T10:51:49Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]]

Mentioned in SAL (#wikimedia-operations) [2023-04-17T10:53:06Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-04-17T10:59:06Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]] (duration: 07m 16s)

Ladsgroup claimed this task.

This should be fixed now for new reuploads. Old reuploads might be fixed via action=purge (emphasize on "might"), if not try a new reupload. If not, you can also send the file size and name and we try to purge it directly from swift.

Also if even reupload doesn't seem to fix it, try purging browser cache (ctrl+shift+R)

Yep, purge helped for File:Druhý Kirchhoffův zákon.png which I had reported at T333679. Thanks!

That file didn't have a reupload since 2009. My guess is that this is the rsvg problem T193352: Update librsvg to ≥2.42.3 (2.44.10) which hopefully will be fixed soon by the upgrade of thumbor

@Ladsgroup I know that it hasn't been reuploaded since 2009. And it is really strange because the main issue with this file is not that it is generated poorly, but that for some reason it has two different versions for the same size. Here, for example, 120 pixels:

Wikitext-ru.svg.png (40×120 px, 1 KB)
Wikitext-ru.svg (1).png (40×120 px, 918 B)

I checked and that's not an issue related to this ticket, the thumbnails are properly cleaned up after action=purge but incorrectly re-thumbnailed with thumbor. Thumbor has a lot of issues with thumbnails in different sizes. My favorite being T325940 that's still not fixed to this day (https://en.wikipedia.org/w/index.php?title=User:Ladsgroup/sandbox&oldid=1135228452).

Please file a bug against thumbor. The new version will be properly pooled and replace the old version soon. That might change things.