Page MenuHomePhabricator

March 2023 Datacenter Switchover
Closed, ResolvedPublic

Description

This is the meta task for the March 2023 Datacenter switchover (eqiad -> codfw).

Switchover

Schedule

Services: Tuesday, February 28th, 2023 14:00 UTC
Traffic: Tuesday, February 28th, 2023 15:00 UTC
MediaWiki: Wednesday, March 1st, 2023 14:00 UTC

Repooling

Traffic repooling of eqiad: Wednesday, March 8th, 2023
restbase-async eqiad pooling: Wednesday, March 8th, 2023
Services/mediawiki-RO eqiad pooling: Tuesday, March 14th, 2023

Checklist

See also:

Switchback

Services: Tuesday, April 25th, 2023 14:00 UTC
Switching back: Wednesday, April 26th, 2023 14:00 UTC

Checklist

Related Objects

Status Subtype Assigned Task
Resolved Clement_Goubert
Resolved Trizek-WMF
Resolved Clement_Goubert
Resolved RLazarus
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved BUG REPORT Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Open Clement_Goubert
Open None
Open None
Open None
Resolved Marostegui
Resolved Andrew
Resolved Marostegui
Resolved Andrew
Declined Andrew
Resolved Andrew
Resolved Andrew
Resolved Ladsgroup
Duplicate None
Resolved Bstorm
Declined None
Resolved taavi
Resolved Jdforrester-WMF
Declined None
Open jijiki
Open None
Resolved jbond
Open None
Open None
Resolved BUG REPORT Clement_Goubert
In Progress Clement_Goubert
Open None
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved eoghan
Resolved eoghan
Resolved jbond
Resolved Dzahn
Resolved Dzahn
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Marostegui
Resolved Clement_Goubert
Declined Dzahn
Resolved ayounsi
Invalid Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Clement_Goubert
Resolved Clement_Goubert
Resolved Clement_Goubert
Open None
Resolved cmooney
Resolved cmooney
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved ayounsi
Resolved Ladsgroup
Resolved herron
Resolved herron
Declined herron
Open herron
Resolved Jclark-ctr
Resolved Jclark-ctr
Resolved Joe
Resolved Cmjohnson
Resolved Jclark-ctr
Resolved Request Jclark-ctr
Resolved sgrabarczuk
Resolved Clement_Goubert
Resolved Marostegui
Resolved Marostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In T327920#8647481, @Tgr wrote:

MwHttpRequest (that is, Guzzle/php-curl) and the URLs from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews. I don't know if RESTBase is involved in that in some way. If you mean the VirtualRESTServiceClient in MediaWiki, that's not used (so there is no parallelism). See WikimediaPageViewService for the code.

Those URLs are RESTBase alright. And if the are being used as is from that page, that is.

  • e.g. GET https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

vs the internal service-mesh

  • e.g. http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

then MediaWiki goes via the edge caches, incurring some extra latency since there are another 3 hops before getting the actual serving cluster. In any case, something to fix after the switchover, thanks for the heads up, we 'll keep it in mind in case we have big issues.

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:10:04Z] <claime> Adding scheduled maintenance for switchover to statuspage - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:31:04Z] <claime> Locking scap deployments for datacenter switchover - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:40:10Z] <claime> Starting mediawiki datacenter switchover step 0 - T327920

Change 891552 merged by Clément Goubert:

[operations/dns@master] db: Switch dns master alias to codfw

https://gerrit.wikimedia.org/r/891552

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:08:18Z] <claime> Phase 9.5 Update DNS records for new database masters - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:09:56Z] <claime> Phase 9.5 DNS records for new database masters updated - T327920

Change 893479 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update pcX DNS

https://gerrit.wikimedia.org/r/893479

Change 892428 merged by jenkins-bot:

[operations/mediawiki-config@master] debug.json: List primary DC servers first

https://gerrit.wikimedia.org/r/892428

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:18:24Z] <cgoubert@deploy2002> Started scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:20:30Z] <cgoubert@deploy2002> cgoubert: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet

Change 893479 merged by Marostegui:

[operations/dns@master] wmnet: Update pcX DNS

https://gerrit.wikimedia.org/r/893479

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:26:18Z] <cgoubert@deploy2002> Finished scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] (duration: 07m 54s)

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:27:02Z] <claime> End mediawiki datacenter switchover - T327920

Change 893675 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/dns@master] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002

https://gerrit.wikimedia.org/r/893675

Change 893675 merged by Clément Goubert:

[operations/dns@master] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002

https://gerrit.wikimedia.org/r/893675

And if the are being used as is from that page, that is.

  • e.g. GET https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

vs the internal service-mesh

  • e.g. http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

then MediaWiki goes via the edge caches, incurring some extra latency since there are another 3 hops before getting the actual serving cluster.

Yeah, it uses the public URL. Ping me when it is a good time to fix that, it should be trivial.

Mentioned in SAL (#wikimedia-operations) [2023-04-24T10:27:40Z] <claime> Datacenter switchover live testing setting db to read-only and back in eqiad - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-24T10:29:58Z] <claime> Datacenter switchover live testing setting db to read-only and back in eqiad successful - T327920

Change 911780 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.mediawiki: Add mw-api-int to mediawiki services

https://gerrit.wikimedia.org/r/911780

Change 911780 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Add mw-api-int to mediawiki services

https://gerrit.wikimedia.org/r/911780

Change 912171 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912171

Change 912235 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] db: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/912235

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:13:59Z] <claime> Locking scap for datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:15:43Z] <cgoubert@deploy1002> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:23:27Z] <claime> Starting mediawiki datacenter switchback preparation - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:45:16Z] <claime> Stopping maintenance scripts for datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:59:49Z] <claime> Going to read-only for mediawiki datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:05:00Z] <claime> Restarting maintenance jobs - T327920

Change 912235 merged by Clément Goubert:

[operations/dns@master] db: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/912235

Change 912302 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912302

Change 912171 abandoned by Marostegui:

[operations/dns@master] wmnet: Update parsercache CNAME

Reason:

I messed up the rebase, so abandoning in favour of 912302

https://gerrit.wikimedia.org/r/912171

Change 912302 merged by Marostegui:

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912302

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:16:00Z] <marostegui> Update dns for parsercache T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:24:46Z] <cgoubert@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920 (duration: 69m 03s)

Clement_Goubert updated the task description. (Show Details)