Page MenuHomePhabricator

Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment
Closed, ResolvedPublicPRODUCTION ERROR

Description

Following promotion of 1.32.0-wmf.10 to group1, an increase in error rate was noticed, specifically errors due to lock wait timeout:

From logstash:

 @timestamp	2018-06-27T19:26:01
t  @version	1
t  _id	AWRCtbW88VFZOIjH7yZi
t  _index	logstash-2018.06.27
#  _score	1
t  _type	mediawiki
t  caught_by	mwe_handler
t  channel	exception
t  exception.class	Wikimedia\Rdbms\DBQueryError
t  exception.code	0
t  exception.file	/srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php:1443
t  exception.message	A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('703487226','22051637')
Function: CommentStore::insertInternal
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.26)
t  exception.trace	#0 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(1413): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(1186): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(2010): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.10/includes/CommentStore.php(549): Wikimedia\Rdbms\Database->insert(string, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(668): Closure$CommentStore::insertInternal(integer)
#5 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(476): MediaWiki\Storage\RevisionStore->insertRevisionRowOn(Wikimedia\Rdbms\DatabaseMysqli, MediaWiki\Storage\MutableRevisionRecord, Title, integer)
#6 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(417): MediaWiki\Storage\RevisionStore->insertRevisionInternal(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DatabaseMysqli, User, CommentStoreComment, Title, integer, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(3726): Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn(Wikimedia\Rdbms\DatabaseMysqli, string)
#8 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(49): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn;3701)
#9 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(523): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#10 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(419): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn;3701)
#11 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/PageUpdater.php(953): MediaWiki\Storage\RevisionStore->insertRevisionOn(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DBConnRef)
#12 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/PageUpdater.php(724): MediaWiki\Storage\PageUpdater->doModify(CommentStoreComment, User, integer)
#13 /srv/mediawiki/php-1.32.0-wmf.10/includes/page/WikiPage.php(1845): MediaWiki\Storage\PageUpdater->saveRevision(CommentStoreComment, integer)
#14 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(294): WikiPage->doEditContent(Wikibase\ItemContent, CommentStoreComment, integer, integer, User)
#15 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(191): Wikibase\Repo\Store\WikiPageEntityStore->saveEntityContent(Wikibase\ItemContent, string, User, integer, integer)
#16 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/lib/includes/Store/TypeDispatchingEntityStore.php(88): Wikibase\Repo\Store\WikiPageEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, integer)
#17 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/repo/includes/EditEntity.php(733): Wikibase\Lib\Store\TypeDispatchingEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, integer)
#18 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/repo/includes/Api/EntitySavingHelper.php(351): Wikibase\EditEntity->attemptSave(Wikibase\DataModel\Entity\Item, string, integer, string)
#19 /srv/mediawiki/php-1.32.0-wmf.10/extensions/Wikibase/repo/includes/Api/ModifyEntity.php(309): Wikibase\Repo\Api\EntitySavingHelper->attemptSaveEntity(Wikibase\DataModel\Entity\Item, string)
#20 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(1584): Wikibase\Repo\Api\ModifyEntity->execute()
#21 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(535): ApiMain->executeAction()
#22 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(506): ApiMain->executeActionWithErrorHandling()
#23 /srv/mediawiki/php-1.32.0-wmf.10/api.php(83): ApiMain->execute()
#24 /srv/mediawiki/w/api.php(3): include(string)
#25 {main}
t  exception_id	WzPkuQpAIDgAACAhg1QAAADO
t  exception_url	/w/api.php?origin=https%3A%2F%2Fko%2Ewikipedia%2Eorg
t  host	mw1344
t  http_method	POST
t  ip	10.64.0.104
t  level	ERROR
t  message	[WzPkuQpAIDgAACAhg1QAAADO] /w/api.php?origin=https%3A%2F%2Fko%2Ewikipedia%2Eorg   Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('703487226','22051637')
Function: CommentStore::insertInternal
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.26)
t  message_checksum	4b76b8a0b1adfffdc7502b77b4b1fae5
t  mwversion	1.32.0-wmf.10
t  normalized_message	[{exception_id}] {exception_url}   Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upd
t  referrer	https://ko.wikipedia.org/
t  reqId	WzPkuQpAIDgAACAhg1QAAADO
t  server	www.wikidata.org
t  shard	s8
t  tags	syslog, es, es
t  type	mediawiki
t  unique_id	WzPkuQpAIDgAACAhg1QAAADO
t  url	/w/api.php?origin=https%3A%2F%2Fko%2Ewikipedia%2Eorg
t  wiki	wikidatawiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

1770 errors in that one hour, only 240 of them have Wikibase in the stack trace, so not really Wikidata related. They do all target revision_comment_temp.

Would be nice to know where the other leg of that lock is. In theory there should only be a conflict when two processes try to add a comment for the same revision, which should not be possible.

Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing.

Is that stack trace representative? is it always INSERT INTO revision_comment_temp?

It's quite possible that this was caused by the MCR patches, but so far, I don't have any clue as to how or why.

From what I can see it was not only related to that table and to that write, there are lots of others, but the INSERT INTO revision_comment_temp (revcomment_rev,revcomment_comment_id) VALUES ('xx','x') appears quite a lot. There are around 1000 errors for that table on that specific timeframe: https://logstash.wikimedia.org/goto/43b11908e6569dd05fc09fe450c9ed06
ie of other complains:

[WzQlFwpAEDUAABFY@OYAAACB] /w/api.php   Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: UPDATE  `category` SET cat_pages = cat_pages - 1,cat_files = cat_files - 1 WHERE cat_title IN ('xx',xx) 
Function: WikiPage::updateCategoryCounts
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.77)

Or:

[WzP3@QpAAEoAAJNLTYUAAAAW] /w/api.php   Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: SELECT  page_latest  FROM `page`    WHERE page_id = 'x' AND page_namespace = '2' AND page_title = 'xx'  LIMIT 1   FOR UPDATE
Function: WikiPage::lockAndGetLatest
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.77)
User::loadFromDatabase	10.64.0.110	1205	Lock wait timeout exceeded; try restarting transaction (10.64.0.110)	SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = 'xx'  LIMIT 1   FOR UPDATE

There has always been a slow but steady stream of lock timeouts:
https://logstash.wikimedia.org/goto/3ee85bb6a8d464d99bcb7edc2d024bc0

logstash.wikimedia.org_app_kibana.png (200×1 px, 27 KB)

revision_comment_temp is the only one that spiked after the deploy(*), so presumably the others are unrelated.

(*) Probably. Same graph with revision_comment_temp filtered out:

logstash.wikimedia.org_app_kibana (1).png (200×1 px, 25 KB)

Our logging could be better, though.

From what I can see it was not only related to that table and to that write, there are lots of others, but the INSERT INTO revision_comment_temp (revcomment_rev,revcomment_comment_id) VALUES ('xx','x') appears quite a lot. There are around 1000 errors for that table on that specific timeframe: https://logstash.wikimedia.org/goto/43b11908e6569dd05fc09fe450c9ed06

There are even more other lock related errors during the period:

Could not acquire lock 'CategoryMembershipUpdates:3673061'
Function: Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages
Error: 1213 Deadlock found when trying to get lock; try restarting transaction
WikiPage::insertOn	10.64.48.23	1213	Deadlock found when trying to get lock; try restarting transaction (10.64.48.23)	INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES ('6','The_Parkville_in_May_2018.jpg','','0','1','0.604298787837','20180627195904','0','0')

And some timeouts:

IndexPager::buildQueryInfo (LogPager)	10.64.0.164:3314	2062	Read timeout is reached (10.64.0.164:3314)	SELECT  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,COALESCE( comment_log_comment.comment_text, log_comment ) AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,log_user,log_user_text,NULL AS `log_actor`,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`    WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` LEFT JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) LEFT JOIN `user` ON ((user_id=log_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','suppress')) AND ((log_user = '654')) AND ((log_deleted & 4) = 0) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag')  ORDER BY log_timestamp DESC LIMIT 51

@Addshore Most of those you point happen all the time, unlike the ones @Marostegui pointed, which are new.

Here's a stack trace for a Commons error, which is more typical (probably. We really need a frequent stack traces view):

#0 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(1413): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(1186): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(2010): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.10/includes/CommentStore.php(549): Wikimedia\Rdbms\Database->insert(string, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(668): Closure$CommentStore::insertInternal(integer)
#5 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(476): MediaWiki\Storage\RevisionStore->insertRevisionRowOn(Wikimedia\Rdbms\DatabaseMysqli, MediaWiki\Storage\MutableRevisionRecord, Title, integer)
#6 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(417): MediaWiki\Storage\RevisionStore->insertRevisionInternal(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DatabaseMysqli, User, CommentStoreComment, Title, integer, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php(3726): Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn(Wikimedia\Rdbms\DatabaseMysqli, string)
#8 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(49): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn;3861)
#9 /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(523): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#10 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/RevisionStore.php(419): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure$MediaWiki\Storage\RevisionStore::insertRevisionOn;3861)
#11 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/PageUpdater.php(953): MediaWiki\Storage\RevisionStore->insertRevisionOn(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DBConnRef)
#12 /srv/mediawiki/php-1.32.0-wmf.10/includes/Storage/PageUpdater.php(724): MediaWiki\Storage\PageUpdater->doModify(CommentStoreComment, User, integer)
#13 /srv/mediawiki/php-1.32.0-wmf.10/includes/page/WikiPage.php(1845): MediaWiki\Storage\PageUpdater->saveRevision(CommentStoreComment, integer)
#14 /srv/mediawiki/php-1.32.0-wmf.10/includes/EditPage.php(2241): WikiPage->doEditContent(WikitextContent, CommentStoreComment, integer, boolean, User, string, array, integer)
#15 /srv/mediawiki/php-1.32.0-wmf.10/includes/EditPage.php(1529): EditPage->internalAttemptSave(array, boolean)
#16 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiEditPage.php(392): EditPage->attemptSave(array)
#17 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(1584): ApiEditPage->execute()
#18 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(535): ApiMain->executeAction()
#19 /srv/mediawiki/php-1.32.0-wmf.10/includes/api/ApiMain.php(506): ApiMain->executeActionWithErrorHandling()
#20 /srv/mediawiki/php-1.32.0-wmf.10/api.php(83): ApiMain->execute()
#21 /srv/mediawiki/w/api.php(3): include(string)
#22 {main}

(logstash)
Per the investigation @jcrespo did in P7314, this seems like a conflict between INSERT INTO revision_comment_temp (revcomment_rev, revcomment_comment_id) VALUES (<revision id>, <comment id>) (in CommentStore::insertInternal, on page edit or upload) and SELECT 1 FROM revision, revision_comment_temp WHERE rev_page = <page id> FOR UPDATE (in WikiPage::doDeleteArticleReal, in preparation for page deletion).

The edit process looks like this:

  • lock (FOR UPDATE) the page row
  • insert a comment row
  • insert a revision row
  • insert a revision_comment_temp row with the IDs from the two previous rows
  • ...some more stuff normally, but this is where the exception above was thrown

The delete process looks like this:

  • lock (FOR UPDATE) the page row
  • lock all revision, revision_comment_temp, revision_actor_temp rows belonging to the page
  • copy all those revision rows to archive. (It seems the comments are inserted again in the process. That could be improved but is irrelevant to this task.)
  • delete the page row
  • delete all the revision rows
  • delete all the revision_comment_temp rows
  • delete all the revision_actor_temp rows
  • delete related ip_changes rows
  • insert logging row

In theory this seems safe: both actions lock page (and nothing else, initially) which acts as a semaphore.

The errors in T197464#4321254 are less strange, they all fail in the page lock step. However, starting the process with page locks is not new in wmf.10 so those should be consequences, not causes.

I got the first error when I used Special:Nuke on Wikimedia Commons. Is it possible that something has not been updated with this old tool?

Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing.

It appears to require multiple users making actions on a wiki simultaneously, and usually when you test a change you're the only user doing anything on your local testing wiki. The errors also did not appear (or at least not in notable numbers) when the changes were deployed to the first set of production wikis on Tuesday (mediawiki.org and test.wp, test2.wp).

Change 442834 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.32.0-wmf.10] Revert MCR RevisionStore changes

https://gerrit.wikimedia.org/r/442834

I created a patch that reverts the MCR patches related to RevisionStore, but keeps the change that introduces PageUpdater.

We could deploy the branch with the RevisionStore stuff reverted, and see if it still blows up. Whether or not it does, we'll know more about the cause.

Bisecting by potentially bringing the site down is not a great option, but I honestly don't have a better idea. Unless perhaps we had some way to generate sufficient load on a test system to trigger the error.

Assuming this was not a case of "It compiles, ship it!" I am curious as to why this wasn't noticed when testing.

It appears to require multiple users making actions on a wiki simultaneously, and usually when you test a change you're the only user doing anything on your local testing wiki.

Not just some concurrent users. Massive concurrent load.

It's a performance problem. These are pretty much impossible to test for. We don't have a duplicate of our server infrastructure under full emulated load.

(As this is related to revisions, could the latest examples in T179884, T197464#4321254, T198177 be somehow related consequences of this problem?)

@Aklapper As far as I can see, recent instance of the first two issues T179884 and T197464#4321254, were probably caused by this. The "overwriting image" one (T198177) seems off. It may still be a consequence somehow, but that issue is about updates to the image table, not about page revisions.

The errors also did not appear (or at least not in notable numbers) when the changes were deployed to the first set of production wikis on Tuesday (mediawiki.org and test.wp, test2.wp).

If this is really MCR-related, that was on the testwikis for two weeks as wmf.999. There is not a single error from that branch. (Then again there is not a single error from any wiki other than Commons / Wikidata, so all that says is that group0 is not really useful for catching load-related problems.)

The DBPerformance log shows a spike during the time wmf-10 was deployed on group1: https://logstash.wikimedia.org/goto/7c86a7d63a305c220a37a3a49844ef2c. The vast majority of entries are for commonswiki. Here are a few examples:

Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#26770d)]: 
0	0.000436	query-m: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES ('X') [TRX#26770d]
1	0.001476	query-m: INSERT INTO `blobs_cluster2N` (blob_text) VALUES ('X')
2	0.000313	query-m: INSERT INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'X') [TRX#26770d]
3	0.000497	query-m: INSERT INTO `comment` (comment_hash,comment_text,comment_data) VALUES ('X',NULL) [TRX#26770d]
4	0.000425	query-m: INSERT INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('X',NULL,NULL) [TRX#26770d]
5	15.847574	query-m: INSERT INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('X') [TRX#26770d]

This is from https://commons.wikimedia.org/w/index.php?title=File:Portrait_of_Maria_van_Rijswijk_Dutch_School_Rijksdienst_voor_het_Cultureel_Erfgoed_B670.jpg&action=edit

I'm surprised to see blobs_cluster2N in there - ExternalStore's blob tables are generally on a different DB server, no? I'll check whether this may be the cause of the problem somehow.

Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#e7bddd)]: 
0	0.000614	query-m: UPDATE `image` SET img_name = 'X' [TRX#e7bddd]
1	0.000412	query-m: UPDATE `image_comment_temp` SET imgcomment_name = 'X' [TRX#e7bddd]
2	0.000372	query-m: UPDATE `oldimage` SET oi_name = 'X' [TRX#e7bddd]
3	0.000363	query-m: INSERT INTO `comment` (comment_hash,comment_text,comment_data) VALUES ('X',NULL) [TRX#e7bddd]
4	0.000396	query-m: INSERT INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('X',NULL,NULL) [TRX#e7bddd]
5	15.724425	query-m: INSERT INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('X') [TRX#e7bddd]

This was a POST to /w/api.php. Presumably an upload.

Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#06688b)]: 
0	0.031600	query-m: INSERT INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'X') [TRX#06688b]
1	0.000452	query-m: INSERT INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('X',NULL,NULL) [TRX#06688b]
2	16.074089	query-m: INSERT INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('X') [TRX#06688b]

From </w/index.php?title=File:T-Shirt_von_Adrew.jpg&action=submit>. Not sure why this is not also touching the page table. Perhaps it's from the code that creates null-revisions for uploads? But that doesn't fit the request URL.

There are about 1000 errors (in the deploy window, on Commons/Wikidata) where the failing query is on the revision_comment_temp table, and only about 50 where it is not. So I think it is fair to assume that is the primary cause.

After staring at the code a bit, my best guess is:

The MCR refactoring introduced doAtomicSection() to RevisionStore::insertRevisionOn(), to preserve consistency between the revision, slots, and content tables. The atomic section also includes the code for writing the blobs, and the code for inserting into the comment table. Previously, insertRevisionOn() did not explicitly start any kind of transaction, though in the context of WikiPage::doEditContent(), it has always been used after startAtomic(), so all of this was already covered by a transaction. But this was not necessarily the case when insertRevisionOn() resp Revision::insertOn() was called directly, e.g. when inserting a null revision during file upload.

Here are a few things I poked at, without finding anything relevant:

@Tgr suspects that the something is grabbing a FOR UPDATE lock on revision_comment_temp. But the only code that seems to do that seems to be in WikiPage::doDeleteArticleReal(), which shouldn't be called so often that it would cause trouble.

Another suspicion was that some code that calls getQueryInfo on Revision or RevisionStore logs the tables returned by that method. However, getQueryInfo() has always returned revision_comment_temp, and I found no new usages of getQueryInfo in the diff between wmf8 and wmf10.

Also it seems to me that in wmf8, revision_comment_temp was already being updated while holding a FOR UPDATE lock on the page table, at least during edits: WikiPage::doModify calls startAtomic() and lockAndGetLatest(), grabbing a FOR UPDATE lock on the page table. Then it calls Revision::insertOn which calls RevisionStore->insertRevisionOn which calls CommentStore->insertWithTempTable which modifies revision_comment_temp.

greg removed dduvall as the assignee of this task.Jun 28 2018, 3:43 PM

Change 442882 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Minimize the work done within atomic section in insertRevisionon().

https://gerrit.wikimedia.org/r/442882

Change 442882 abandoned by Daniel Kinzler:
Minimize the work done within atomic section in insertRevisionon().

Reason:
should not be on master

https://gerrit.wikimedia.org/r/442882

Change 442883 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.32.0-wmf.10] Minimize the work done within atomic section in insertRevisionon().

https://gerrit.wikimedia.org/r/442883

The above patch is an alternative attempt to fix the lock retention issue. It's the best I can think of. Not sure this will fix the problem, though.

Change 442889 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Fix table locking in WikiPage::doDeleteArticleReal

https://gerrit.wikimedia.org/r/442889

$ mwscript shell.php mediawikiwiki
>>> $mainPage = Title::newFromText( 'Main Page' );
>>> $pageId = $mainPage->getArticleId();
>>> $info = Revision::newFromTitle( $mainPage )->getQueryInfo();

>>> $tablesFlat = [];
>>> array_walk_recursive( $info['tables'], function ( $a ) use ( &$tablesFlat ) { $tablesFlat[] = $a; } );

>>> $old = array_intersect( $info['tables'], [ 'revision', 'revision_comment_temp', 'revision_actor_temp' ] );
>>> $new = array_intersect( $tablesFlat, [ 'revision', 'revision_comment_temp', 'revision_actor_temp' ] );

>>> wfGetDb(DB_REPLICA)->select($old, '1', [ 'rev_page' => $pageId ], '', [], $info["joins"])->numRows();
=> 3
>>> wfGetDb(DB_REPLICA)->select($new, '1', [ 'rev_page' => $pageId ], '', [], $info["joins"])->numRows();
=> 298530

Change 442893 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Minimize the work done within atomic section in insertRevisionon().

https://gerrit.wikimedia.org/r/442893

It seems there are two competing patches here, and given the resulting data loss/corruption from yesterday I'm hesitant to roll the train today with anything that only theoretically addresses the root cause.

@Tgr, your patch seems functionally verifiable (and logically addresses the root cause) and you've already shown your work there, but can we also verify the fix somehow under conditions similar to those that surfaced the problem (e.g. delete article load on testwiki)? Or would that even surface it? More generally, do you think we need more time to test your patch?

@daniel, the "Experimental:" in your commit msg gives me pause about rolling it out on 1.32.0-wmf.10. :) How confident are you that it addresses the root cause of yesterday's outage?

The DBPerformance log shows a spike during the time wmf-10 was deployed on group1: https://logstash.wikimedia.org/goto/7c86a7d63a305c220a37a3a49844ef2c. The vast majority of entries are for commonswiki. Here are a few examples:

Sub-optimal transaction on DB(s) [10.64.48.23 (commonswiki) (TRX#26770d)]: 
0	0.000436	query-m: INSERT IGNORE INTO `page` (page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES ('X') [TRX#26770d]
1	0.001476	query-m: INSERT INTO `blobs_cluster2N` (blob_text) VALUES ('X')
2	0.000313	query-m: INSERT INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'X') [TRX#26770d]
3	0.000497	query-m: INSERT INTO `comment` (comment_hash,comment_text,comment_data) VALUES ('X',NULL) [TRX#26770d]
4	0.000425	query-m: INSERT INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('X',NULL,NULL) [TRX#26770d]
5	15.847574	query-m: INSERT INTO `revision_comment_temp` (revcomment_rev,revcomment_comment_id) VALUES ('X') [TRX#26770d]

This is from https://commons.wikimedia.org/w/index.php?title=File:Portrait_of_Maria_van_Rijswijk_Dutch_School_Rijksdienst_voor_het_Cultureel_Erfgoed_B670.jpg&action=edit

I'm surprised to see blobs_cluster2N in there - ExternalStore's blob tables are generally on a different DB server, no? I'll check whether this may be the cause of the problem somehow.

Any DB writes during (not necessarily *in*) the transaction round show up. Note that different handles have different transaction TRX numbers. The handles in auto-commit mode have no TRX number assigned (by MW); note that the blob queries has no such number. The list of DBs at the top is only for servers in the transaction round proper (e.g. not auto-commit)

Change 442898 had a related patch set uploaded (by Daniel Kinzler; owner: Gergő Tisza):
[mediawiki/core@wmf/1.32.0-wmf.10] Fix table locking in WikiPage::doDeleteArticleReal

https://gerrit.wikimedia.org/r/442898

@dduvall never mind my patches. The second one won't fix the issue, and the first one is overkill ("revert half the mcr stuff"). It looks like @Tgr got it!

Mentioning for reference: T191892: Reduce locking contention on deletion of pages and T191875#4120050. This is where the code that tgr restored came from.

Change 442893 abandoned by Daniel Kinzler:
Minimize the work done within atomic section in insertRevisionon().

Reason:
Not needed

https://gerrit.wikimedia.org/r/442893

Change 442883 abandoned by Daniel Kinzler:
Minimize the work done within atomic section in insertRevisionon().

Reason:
not needed afterall

https://gerrit.wikimedia.org/r/442883

Change 442834 abandoned by Daniel Kinzler:
Revert MCR RevisionStore changes

Reason:
not needed, luckily

https://gerrit.wikimedia.org/r/442834

Thanks again for mobilizing! And thanks for the fix, @Tgr!

Per our conversation in #wikimedia-releng, I'll plan on rolling this out in two stages today, to (group1 - commonswiki), and then to commonswiki if all goes well. We'll have to play group2 by ear and see how much time we have for thorough verification.

Change 442898 merged by Dduvall:
[mediawiki/core@wmf/1.32.0-wmf.10] Fix table locking in WikiPage::doDeleteArticleReal

https://gerrit.wikimedia.org/r/442898

Change 442889 merged by jenkins-bot:
[mediawiki/core@master] Fix table locking in WikiPage::doDeleteArticleReal

https://gerrit.wikimedia.org/r/442889

For the record, here is when and why I added the offending code:
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/406595/44/includes/page/WikiPage.php#2880

This became necessary because of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/406595/44/includes/Storage/RevisionStore.php#1788, which was removed again as the change matured. technically though, doDeleteArticleReal() should not rely on getQueryInfo() returning a flat list of table names. There is no guarantee that it will continue doing that.

I remember discussing how the flattened array would interact with the joins, I think with @Anomie, but I can't find that discussion right now. Apparently we got it wrong in the end.

Another side-effect of this issue are couple of hundred of broken uploads, see Category:Files_with_no_description_pages_(uploaded_June_27,_2018) for files uploaded without creation of description pages. The list was compiled with https://quarry.wmflabs.org/query/27897 .

I just got [WzU7SwpAME8AAGa@HYEAAAAC] 2018-06-28 19:47:39: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" trying to delete a file.

#metoo: [WzU7GQpAICoAAIy18VQAAABW] 2018-06-28 19:46:51: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

Mentioned in SAL (#wikimedia-operations) [2018-06-28T20:07:08Z] <dduvall@deploy1001> Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Syncing table locking fix (T198350) (duration: 00m 57s)

Another side-effect of this issue are couple of hundred of broken uploads, see Category:Files_with_no_description_pages_(uploaded_June_27,_2018) for files uploaded without creation of description pages. The list was compiled with https://quarry.wmflabs.org/query/27897 .

@Tgr @daniel others: There are some related tasks (eg T198177) to this issue mentioned here ^, who will follow-up on these and address the issues?

@Tgr's fix worked. Thanks again, @Tgr! I'll remove this as a blocker but leave it open for tracking related cleanup work.

In T198350#4323192, @greg wrote:

@Tgr @daniel others: There are some related tasks (eg T198177) to this issue mentioned here ^, who will follow-up on these and address the issues?

I'm not sure how much can be done about those - usually the issue is that the files do not have a description page because the page edit that would have created it timed out and the transaction got rolled back. AFAIK there is no way to recover the contents of those queries which did not get committed in the end. Where possible, just delete and reupload?

The generic issue is T153565: MediaWiki file operations are fragile, causing occasional data loss - MediaWiki doesn't properly roll back file operations on failure.

As a possible followup, DB error logging could be improved. It would have been convenient if we could have queried the exact DB error or the method were there error got triggered - these things are included in the logstash record, but not in a structured form.

As another possible followup, I wonder how much the current group1 / group2 split makes sense? Commons and Wikidata are huge wikis, on par with the largest Wikipedias, with highly specialized extensions. Commons problems tend to take a lot of effort to clean up (MediaWiki is not very good at file-related error handling) and Wikidata problems can spill over to the Wikipedias. Maybe those two should be group2 and some another large Wikipedia(s) should be moved to group1 instead?

In T198350#4323348, @Tgr wrote:

As a possible followup, DB error logging could be improved. It would have been convenient if we could have queried the exact DB error or the method were there error got triggered - these things are included in the logstash record, but not in a structured form.

As another possible followup, I wonder how much the current group1 / group2 split makes sense? Commons and Wikidata are huge wikis, on par with the largest Wikipedias, with highly specialized extensions. Commons problems tend to take a lot of effort to clean up (MediaWiki is not very good at file-related error handling) and Wikidata problems can spill over to the Wikipedias. Maybe those two should be group2 and some another large Wikipedia(s) should be moved to group1 instead?

both good points Gergo. I'll file a task regarding the second, can you do similar for the first?

@Tgr's fix worked. Thanks again, @Tgr! I'll remove this as a blocker but leave it open for tracking related cleanup work.

Should we close this and create individual tickets for the follow up tasks instead?

I'm not sure how much can be done about those

May I suggest an actionable? I agree than in the case of an outage, things like new uploads and new content can be lost. I don't think there is much we can do about that, aside from giving clear error messages, encouraging users to try the uploads again after some time, etc. If there is a failure on software, specially one that we cannot predict, there is certainly not much that can be done by the software itself (I guess browsers could do some offline, local js trickery, but I would consider that out of the scope).

The thing that actually worries me, and apparently happened on the last 2 commons outages, is that 10-year old images that were being overwritten are lost when overwritten. I think something can be done in those cases, not to avoid failure, but to avoid data loss. It is my understanding that the files have not been lost- only renamed on swift, but the database failed to be updated (or some anomalous in-between state). I think we could find those cases and fix them on the database, and also change the process so that things are done in a different order (eg. file is copied first, then database updated, then file is removed) without redoing the whole file upload process. We can test detection with the 2 examples at T198177, and then find past cases so even if we don't do code changes, at least we can manually restore things in case of a data loss.

Another 50 files with no metadata pages were uploaded yesterday. See Category:Files with no description pages (uploaded June 28, 2018).

Change 443083 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Warn about unused join conditions

https://gerrit.wikimedia.org/r/443083

The thing that actually worries me, and apparently happened on the last 2 commons outages, is that 10-year old images that were being overwritten are lost when overwritten.

That certainly is a sad state of affairs and could be improved by some sort of journaling strategy (where a list of pending file operations is persisted on disk first, and can be used for recovery in case of an error - IIRC something like that exists for the FileBackend already but is not used by the higher levels for recovery).

As for recovery, the last such attempt was in T71311. It's a massive time sink for relatively little benefit, though.

I remember discussing how the flattened array would interact with the joins, I think with @Anomie, but I can't find that discussion right now. Apparently we got it wrong in the end.

It looks like that was at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/406595/47/includes/page/WikiPage.php#2884.

You later added a comment to check on that in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/406595/62/includes/page/WikiPage.php#2889 (oddly, the comment was on PS62 but PS72 was current). I don't see any other comments on the code before it was merged.

In T198350#4323364, @greg wrote:

both good points Gergo. I'll file a task regarding the second, can you do similar for the first?

Done: T198750: Improve logging of MediaWiki DB errors
Also: T198755: Log the query that caused a lock timeout although I suspect this is unrealistic.

Can this be closed? The issue at hand is fixed, as far as I know.

Follow-ups should get their own tickets.

Just do it, folks :)

mmodell changed the subtype of this task from "Release" to "Production Error".Aug 28 2019, 11:09 PM

Change #443083 abandoned by Daniel Kinzler:

[mediawiki/core@master] Warn about unused join conditions

Reason:

https://gerrit.wikimedia.org/r/443083