Page MenuHomePhabricator

Allow to store files between 4 and 5 GB
Closed, ResolvedPublic

Description

Currently, Mediawiki can use a swift backend to store files. Out of the box, without any provision for larger objects, swift is able to save file up to 5 GB.

A filesize limit currently exists at 2^32 (4 GB).

It would be convenient to allow MediaWiki to store until 5 GB when the swift backend is used.

For that, we should ensure file sizes are stored as a 64 bits integer, not as a 32 one.

See also enwiki: Village pump (proposals) - RfC: Increasing the maximum size for uploaded files

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@MatthewVernon I think you're the right person to ask. With work being done to make MediaWiki no longer be limited to 4GB files, I was wondering what SRE's opinion on increasing the commons file size limit to 5GB. My understanding is that this is the limit for non-split up files in swift. If the MediaWiki limitations were removed, do you think it would be reasonable to increase the limit at Wikimedia? Having mildly larger files would presumably put more pressure on our media storage/serving infrastructure, although i would expect that the number of 5GB files would be fairly limited.

Long term it would be great to support really big files with swift large objects, but 5GB seems like a small improvement we can do almost today with very little work.

There is also the middle ground option of not allowing users to upload 5gb files in general, but still allowing server side uploads of such files via importImages.php

Bugreporter renamed this task from Allow to store files between 4 and 5 Gb to Allow to store files between 4 and 5 GB.Oct 15 2023, 3:03 PM
Bugreporter updated the task description. (Show Details)

As a note on current status, as part of T191805, mediawiki will now accept files with swift up to 5GB. $wgMaxUploadSize is 4gb, so this only affects files uploaded from command line via importImages.php. Additionally the schema change (T348183) is not deployed as of yet.

Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that files were always smaller than 4 GB for backups, and I may need to review its storage compatibility- even if it is just applying the same schema change on backup metadata.

@AlexisJazz While we are happy that you are excited about this, this is by far not ready for discussion. Developers just handed out the code, but this requires still a lot of preparation and discussion to be able to be implemented at WMF by system administrators due to the scale of operations- with a lot of open questions regarding Swift extra space needed, backup compatibility, schema changes deployment, and many other work needed.

While community input is very much welcome, I feel that asking that is not ok at the moment, as no matter how much voting would be for yes, it cannot be enabled at the moment. I would suggest for you to pause any discussion on wiki to avoid disappointment - no amount of support will make technical problems solve faster, and I feel you should wait for the option to be available first in our servers (it is not right now, and it won't be until all engineers agree it is ready). For example: storing more data requires more disk space, and that requires a budget to buy more servers, and that is usually approved at the end of the fiscal year, if approved. Sorry, but things take time.

Please understand that the title "Allow to store files between 4 and 5 GB" refers to the technical ability- and the preparation needed for that, not community consensus.

@jcrespo thanks for letting me know. I misunderstood Bawolff's comment.

Well, I can partially answer one of your open questions. You won't really need extra space for English Wikipedia. The largest file on enwiki is currently just shy of 300MB. I'd be surprised if even 20 feature films transcoded to >4GB would be uploaded over the duration of a whole year. 20GB (say 30GB including MediaWiki transcodes) is probably a rounding error for you anyway. If I had believed my proposal would result in a substantial increase of the storage space needed I would have asked you first, but the use case for >4GB files on enwiki, while there certainly is one, is limited in quantity. It could be more substantial if enwiki suddenly decided to mass-upload all the PD-USonly feature films that can be found, but that would be more substantial regardless of whether the limit is 4GB or 5GB.

Even on Commons I doubt you'll notice the impact. Commons currently has about 660 files over 3GB. About 435 of those are over 3.5GB. The oldest one is from 2012: https://commons.wikimedia.org/wiki/File:2012_State_Of_The_Union_Address_(720p).ogv and the second oldest is from 2016. I'd estimate that 500GB/year extra will probably cover it for the next few years. 500GB (say 750GB including MW transcodes) is less than 1% of the storage I have at my disposal. I'd be severely worried if you couldn't handle it.. (I know your storage costs way more per GB, backups, redundancy, cache, etc, but still!)

Drop me a note on my talk page on enwiki when this is available on betacommons if you need help testing.

Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that files were always smaller than 4 GB for backups, and I may need to review its storage compatibility- even if it is just applying the same schema change on backup metadata.

To clarify, do you just want to be informed or is backup support blocking this? The reason i ask is one potential implementation path (nothing has been decided yet or even really talked about yet) is in the beginning allow uploading a few large files on special request before allowing it in general. If so, that might happen quicker then you think for a small number of files as there are less capacity issues if its just special cases. To emphasize, nothing has been decided, that is just one potential path.


Re Alexis - yeah, this is too early for community consensus. However in the event that any community people have objections to increasing, please let me know. I don't suspect this to be a controversial change when it eventually happens, but if there are any problems/objections i would like to know early.

Thank you, @AlexisJazz that's useful feedback that without doubt will make our media storage happy- still there are additional technical operations and challenges to overcome. Cost is not as much the concern (specially for enwiki needs)- we get more concerned about Commons with its almost half a PB of storage, but still servers have to be purchased, racked and installed, data resharded, and everything planned, and it takes some time- it is not a question of just "buying larger disks". :-D

Please keep subscribed for more updates.

is backup support blocking this

It is not a hard blocker, as I am guessing not a lot of files will be uploaded soon- and they may just fail being backuped up, and we can retry them later, but I would like to fix that ASAP- and as I didn't know this was ongoing it was a bit of sudden news. Nothing that cannot be solved- I believe MinIO max object size is 50TB, so only metadata storage has to be reviewed. If it later grows beyond that it may impact the storage decisions in the future.

Indeed, the same schema change for production has to be applied to backup metadata, as we mirrored the size from mediawiki as an unsigned int:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/mediabackups/+/refs/heads/master/sql/mediabackups.sql#99

Please give me until next week to apply that schema change- should be easy.

Change 973364 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/mediabackups@master] sql: Migrate mediabackups metadata size from int to bigint

https://gerrit.wikimedia.org/r/973364

Change 973364 merged by Jcrespo:

[operations/software/mediabackups@master] sql: Migrate mediabackups metadata size from int to bigint

https://gerrit.wikimedia.org/r/973364

Mentioned in SAL (#wikimedia-operations) [2023-11-17T11:10:08Z] <jynus> running schema change on backup1-eqiad (mediabackups) T191804

Mentioned in SAL (#wikimedia-operations) [2023-11-17T11:20:46Z] <jynus> running schema change on backup1-codfw (mediabackups) T191804

This is now deployed and media-backups schema is up to date. Media backups are flowing as usual. I am no longer a blocker here.

As a note

My expectation is that the primary usage for a higher limit would be:

  • Long (or HD) videos which previously had to use more aggressive compression to fit in 4GB
  • Videos which previously had to be split up into multiple files

Right now we are averaging 12 files > 3GB a month on commons (overwritten and deleted files are negligible), and 3 files / month > 3.9GB. [enwiki has 0].

It seems for the most part large files are special cases, and I'm doubtful that increasing the limit will affect much in terms of capacity.

As a note

My expectation is that the primary usage for a higher limit would be:

  • Long (or HD) videos which previously had to use more aggressive compression to fit in 4GB
  • Videos which previously had to be split up into multiple files

Right now we are averaging 12 files > 3GB a month on commons (overwritten and deleted files are negligible), and 3 files / month > 3.9GB. [enwiki has 0].

It seems for the most part large files are special cases, and I'm doubtful that increasing the limit will affect much in terms of capacity.

There's one more use I think: the 4K transcode of https://commons.wikimedia.org/wiki/File:Politparade.webm failed, presumably due to the 4GB limit.

Unrelated: note https://commons.wikimedia.org/wiki/File:Gameplay_0_A.D._Alpha_26_Gefecht_gegen_KI_20221106_Teil_01_von_10.webm which is under 4 minutes yet it's hugging the 4GB limit. This may be lossless?

There's one more use I think: the 4K transcode of https://commons.wikimedia.org/wiki/File:Politparade.webm failed, presumably due to the 4GB limit.

More likely it hit a timeout. The lower resolution transcodes were already taking 9 hours. However if the transcode was > 4GB then the FileStoreRepo changes would fix it. Anyways, this task won't fix everything about large video files, there are still aspects that are going to be shaky for very large files. [edit after resetting the transcode it worked. The transcode is 3.88GB which is right on the edge. Perhaps it would have been over in an earlier version of the transcoding software]

Unrelated: note https://commons.wikimedia.org/wiki/File:Gameplay_0_A.D._Alpha_26_Gefecht_gegen_KI_20221106_Teil_01_von_10.webm which is under 4 minutes yet it's hugging the 4GB limit. This may be lossless?

Its 60 fps 4K video. If it was lossless I'd expect it to be a lot more than 4GB. Given its from a video game,maybe some sort of streaming setup was used where the video was encoded live. Often those setups trade latency for less efficient compression. The bitrate is 151 mbps. I believe normal bitrate for 60fps 4k video is usually around 60 mbps, so its only about triple normal.

There's one more use I think: the 4K transcode of https://commons.wikimedia.org/wiki/File:Politparade.webm failed, presumably due to the 4GB limit.

More likely it hit a timeout. The lower resolution transcodes were already taking 9 hours. However if the transcode was > 4GB then the FileStoreRepo changes would fix it. Anyways, this task won't fix everything about large video files, there are still aspects that are going to be shaky for very large files.

How about both? The 1440p VP9 transcode is already 3.9GB.

Some transcodes that would result in a >4GB but <5GB file are probably failing now but should succeed once the limit is raised. This might result in a sudden little bump in storage use. (though given the number of existing large videos it's probably still a drop in the bucket)

Unrelated: note https://commons.wikimedia.org/wiki/File:Gameplay_0_A.D._Alpha_26_Gefecht_gegen_KI_20221106_Teil_01_von_10.webm which is under 4 minutes yet it's hugging the 4GB limit. This may be lossless?

Its 60 fps 4K video. If it was lossless I'd expect it to be a lot more than 4GB.

I'm unsure how efficient VP9 lossless encoding is, never tried it. I worked with uncompressed video in the past, IIRC that was ~1GB/minute for SD video using Huffyuv.

But being footage of a video game that's not in constant motion, if the codec only saves image data that changed compared to the previous frame and/or if it compresses multiple frames using the same dictionary I suspect 1GB/minute is maybe possible. Also, video game footage in this particular case possibly compresses better than live action footage.

Given its from a video game,maybe some sort of streaming setup was used where the video was encoded live. Often those setups trade latency for less efficient compression. The bitrate is 151 mbps. I believe normal bitrate for 60fps 4k video is usually around 60 mbps, so its only about triple normal.

I've analyzed some screenshots and you're right. There are some compression artifacts visible around the text when zoomed in. Maybe the high bitrate is on purpose after all, this game has a lot of detail and sharp edges (and also includes some text) which would be quite susceptible to degradation from lossy compression.

I think it's fair to say 12 5GB files a month would not be overwhelming (about 2TB of raw capacity per cluster per year given 3x replication, cf. our current growth rate of very approximately 120TB/year), and the underlying filesystems that swift sits upon could cope with some 5GB objects.

Server side uploads should now work up to 5gb.

I think we should upload a few > 4gb files via server side upload just to make sure it all works, before enabling for chunked upload.

Change 1002813 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[operations/mediawiki-config@master] Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB).

https://gerrit.wikimedia.org/r/1002813

Change 1002813 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB).

https://gerrit.wikimedia.org/r/1002813

Mentioned in SAL (#wikimedia-operations) [2024-02-13T08:19:58Z] <hashar@deploy2002> Started scap: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-13T08:21:41Z] <hashar@deploy2002> hashar and bawolff: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-13T08:28:55Z] <hashar@deploy2002> Finished scap: Backport for [[gerrit:1002813|Increase $wgMaxUploadSize to 5 GiB (previously was 4GiB). (T191804)]] (duration: 08m 57s)

Bawolff claimed this task.

@Bawolff
Re: Tech News - What wording would you suggest as the content, and When should it be included? Thanks!

@Bawolff
Re: Tech News - What wording would you suggest as the content, and When should it be included? Thanks!

"The maximum file size when using Upload Wizard is now 5 GiB."

Change has been deployed to all wikis, it can go out anytime.