Page MenuHomePhabricator

Database primary master failover on s8 (wikidatawiki)
Closed, ResolvedPublic

Description

We need to replace the current primary database master for wikidatawiki.
This host is old and out of warranty, so needs to be decommissioned. In addition, we need a host with bigger disk to be able to continue with the wb_terms table redesign (T221764).

We would need a 30 minutes read-only window for Wikidatawiki.

Date: Tue 30th July
Time: 05:00AM UTC - 05:30 AM UTC (if everything goes as planned we would not use the 30 minutes window)

Impact: All Wikidatawiki will go read-only. No edits will be allowed. Reads will not be impacted.

Event Timeline

Marostegui triaged this task as Medium priority.Jul 4 2019, 9:18 AM

Thank you!

Johan moved this task from Backlog to Started on the MoveComms-Support (Jul-Sep-2019) board.

Is there an existing procedure for reflecting the page moves and deletions that will occur during the period that Wikidata is read-only?

Items are updated for moves via the job queue, so unless I’m mistaken the job should fail while the wiki is read-only and be retried automatically at a later time.

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

In T227063#5320325, @Johan wrote:

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

Lucas is right, technically any job that fail for any reason will gets retried until they pass or they pass the max number of failures (it's 30 I think) but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges, is it intentional @hoo or do you think we should change the behavior.

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

In T227063#5323883, @Johan wrote:

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

Yes, we might fix that or we might say it's a bearable loss. I don't know enough context to say which one is better.

@Ladsgroup @hoo Either way it'd be great if we could come to a conclusion on what, so we know what to tell the communities.

(I mean, I'd certainly prefer if pages weren't lost for years in the language links because of a move during these minutes, but I don't know what the cost of making sure that wouldn't happen would be.)

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

Can we see those debug messages anywhere, by the way? I assume the X-Wikimedia-Debug header doesn’t help us with jobs, and I’m not sure if debug-level messages are usually saved elsewhere (except on testwiki and test2wiki, apparently, which aren’t Wikibase repositories).

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

But that would also retry on for example sitelink conflicts (other item already has that sitelink), but I guess that's bearable.

Why are jobs even being run during read only time? After quickly skimming the job queue code, this shouldn't happen AFAICT, is there a task/ documentation about that?

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

So far posted in:

I'll figure out what to do about banners and include it in the issue of Tech News the weak of the read-only period, and then we should be done with this part of the preparations.

In T227063#5332149, @Johan wrote:

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

It seems to me, yes.

I talked to @Trizek-WMF about banners. Since edits coming from other wikis will be run after the read-only period which hopefully will be rather short I don't think we need to do a banner for all Wikimedia wikis, but we'll set up one for Wikidata.

An information banner will be displayed on Wikidata, between 04:30 UTC and 05:30 UTC on the 30th of July.

If some messages are planed to be left on the wikis, here is the link for that banner's translations: https://meta.wikimedia.org/w/index.php?title=Special:Translate&group=Centralnotice-tgroup-read_only_banner&task=view&filter=%21translated&action=translate Main languages are translated (at least most of them), but more languages are always welcomed.

The failover was done successfully.
read-only start: 05:00:50
read-only stop: 05:02:21

Total read-only time: 01:31 minutes

Thanks for helping with the communication with the community!