Database primary master failover on s8 (wikidatawiki)
Closed, ResolvedPublic

Actions

Assigned To

Authored By

	Marostegui
	Jul 2 2019, 8:17 AM

Description

We need to replace the current primary database master for wikidatawiki.
This host is old and out of warranty, so needs to be decommissioned. In addition, we need a host with bigger disk to be able to continue with the wb_terms table redesign (T221764).

We would need a 30 minutes read-only window for Wikidatawiki.

Date: Tue 30th July
Time: 05:00AM UTC - 05:30 AM UTC (if everything goes as planned we would not use the 30 minutes window)

Impact: All Wikidatawiki will go read-only. No edits will be allowed. Reads will not be impacted.

Related Objects

Search...

Status	Assigned	Task
Resolved	RLazarus	T243314 FY2020-2021 Q1 DC switchover and switchback
Resolved	RLazarus	T243316 FY2020-2021 Q1 eqiad -> codfw switchover
Resolved	Marostegui	T186188 Failover DB masters in row D
Resolved	Addshore	T208425 [EPIC] Kill the wb_terms table
Resolved	Addshore	T221764 Overview of wb_terms redesign
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Marostegui	T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment
Resolved	Marostegui	T217396 Decommission db1061-db1073
Resolved	Marostegui	T227062 Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required)
Resolved	Johan	T227063 Database primary master failover on s8 (wikidatawiki)

Event Timeline

Marostegui created this task.Jul 2 2019, 8:17 AM

• Lucas_Werkmeister_WMDE subscribed.Jul 2 2019, 10:14 AM

Keegan added a subscriber: Lea_Lacroix_WMDE.Jul 2 2019, 5:03 PM

Early announcement at https://discuss-space.wmflabs.org/t/very-early-announcement-wikidata-read-only-time-at-the-end-of-the-month/413

Thank you!

Quiddity moved this task from To Triage to In current Tech/News draft on the User-notice board.Jul 4 2019, 6:51 PM

• alaa_wmde mentioned this in T219123: Migrate to and read from new store for item terms.Jul 5 2019, 1:02 PM

Johan claimed this task.Jul 9 2019, 11:56 AM

Restricted Application added a project: User-Johan. · View Herald TranscriptJul 9 2019, 11:56 AM

Johan moved this task from Backlog to Do now on the User-Johan board.Jul 9 2019, 12:00 PM

Johan moved this task from Backlog to Started on the MoveComms-Support (Jul-Sep-2019) board.

• Elitre awarded a token.Jul 9 2019, 12:38 PM

Johan moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Jul 9 2019, 1:08 PM

Is there an existing procedure for reflecting the page moves and deletions that will occur during the period that Wikidata is read-only?

Items are updated for moves via the job queue, so unless I’m mistaken the job should fail while the wiki is read-only and be retried automatically at a later time.

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

In T227063#5320325, @Johan wrote:

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

Lucas is right, technically any job that fail for any reason will gets retried until they pass or they pass the max number of failures (it's 30 I think) but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges, is it intentional @hoo or do you think we should change the behavior.

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

In T227063#5323883, @Johan wrote:

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

Yes, we might fix that or we might say it's a bearable loss. I don't know enough context to say which one is better.

@Ladsgroup @hoo Either way it'd be great if we could come to a conclusion on what, so we know what to tell the communities.

(I mean, I'd certainly prefer if pages weren't lost for years in the language links because of a move during these minutes, but I don't know what the cost of making sure that wouldn't happen would be.)

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

Can we see those debug messages anywhere, by the way? I assume the X-Wikimedia-Debug header doesn’t help us with jobs, and I’m not sure if debug-level messages are usually saved elsewhere (except on testwiki and test2wiki, apparently, which aren’t Wikibase repositories).

In T227063#5328194, @Lucas_Werkmeister_WMDE wrote:

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

But that would also retry on for example sitelink conflicts (other item already has that sitelink), but I guess that's bearable.

Why are jobs even being run during read only time? After quickly skimming the job queue code, this shouldn't happen AFAICT, is there a task/ documentation about that?

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

So far posted in:

I'll figure out what to do about banners and include it in the issue of Tech News the weak of the read-only period, and then we should be done with this part of the preparations.

In T227063#5332149, @Johan wrote:

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

It seems to me, yes.

I talked to @Trizek-WMF about banners. Since edits coming from other wikis will be run after the read-only period which hopefully will be rather short I don't think we need to do a banner for all Wikimedia wikis, but we'll set up one for Wikidata.

An information banner will be displayed on Wikidata, between 04:30 UTC and 05:30 UTC on the 30th of July.

If some messages are planed to be left on the wikis, here is the link for that banner's translations: https://meta.wikimedia.org/w/index.php?title=Special:Translate&group=Centralnotice-tgroup-read_only_banner&task=view&filter=%21translated&action=translate Main languages are translated (at least most of them), but more languages are always welcomed.

The failover was done successfully.
read-only start: 05:00:50
read-only stop: 05:02:21

Total read-only time: 01:31 minutes

Thanks for helping with the communication with the community!

Johan moved this task from Do now to Archive on the User-Johan board.Jul 30 2019, 8:02 AM

• Elitre moved this task from Started to Evaluated on the MoveComms-Support (Jul-Sep-2019) board.Jul 31 2019, 9:07 AM

Ixocactus subscribed.Sep 17 2019, 4:35 AM

Veracious awarded a token.Sep 17 2019, 5:11 AM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:54 PM

Database primary master failover on s8 (wikidatawiki) Closed, ResolvedPublic Actions

Description

Related Objects Search...

Event Timeline

Database primary master failover on s8 (wikidatawiki)
Closed, ResolvedPublic

Actions

Related Objects

Search...