Page MenuHomePhabricator

Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC
Closed, ResolvedPublic

Description

db1069 is currently x1 master and needs to be failoved over to db1120 (a newer and more powerful host).
db1069 has been suffering intermittent memory issues (T201133) as well as disks being on predictive failure.
This is a very old host that needs decommissioning (T217396) as it has been out of warranty for a long time

x1 cannot be put on read-only on a mediawiki level, so it will need to be done on MySQL itself.

I am tagging Growth-Team Language-Team Release-Engineering-Team and Cognate like we did last time.
The expected downtime is around 1 minute.

Apart from flowdb database, the tables on the x1 wikis are:

+-------------------+
| Tables_in_enwiki  |
+-------------------+
| aft_feedback      |
| echo_email_batch  |
| echo_event        |
| echo_notification |
| echo_target_page  |
+-------------------+

The procedure will be:

  • Move all the slaves under db1120 (new master)
  • Run failover script (this takes around 3 seconds and will set mysql on read-only before switching the master)
  • Deploy mediawiki config with the new master in place (change will be already merged and will be deployed with --force, so I expect it to take around 30 seconds)

When: 3rd July at 06:00 AM UTC
Impact: Writes will be blocked for around 1 minute, reads WILL NOT be affected

Event Timeline

Marostegui triaged this task as Medium priority.Jun 24 2019, 7:59 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 518651 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1120

https://gerrit.wikimedia.org/r/518651

Change 518651 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1120

https://gerrit.wikimedia.org/r/518651

Mentioned in SAL (#wikimedia-operations) [2019-06-24T08:08:48Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1120 for upgrade T226358 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-06-24T08:09:18Z] <marostegui> Stop MySQL on db1120 for upgrade - T226358

Mentioned in SAL (#wikimedia-operations) [2019-06-24T08:24:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1120 after upgrade T226358 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-06-24T08:52:19Z] <marostegui> Upgrade Mysql on db1140 (checked that all snapshots backups are done) - T226358

A new thing that also gets affected is url shortener. FYI.

Thanks @Ladsgroup!
We have always talked about documenting who and which teams to tag when planning x1 switchovers, so I have created this https://wikitech.wikimedia.org/wiki/MariaDB#Special_section:_x1_master_switchover (based on this task and the previous ones).

Thanks @Ladsgroup!
We have always talked about documenting who and which teams to tag when planning x1 switchovers, so I have created this https://wikitech.wikimedia.org/wiki/MariaDB#Special_section:_x1_master_switchover (based on this task and the previous ones).

Updated for url shotener

tgr@stat1006:~$ analytics-mysql enwiki --use-x1

mysql:research@dbstore1005.eqiad.wmnet [enwiki]> select distinct table_name from information_schema.tables where table_schema in (select table_schema from information_schema.tables where table_name = 'echo_target_page');
+-----------------------+
| table_name            |
+-----------------------+
| _echo_target_page_new |
| echo_email_batch      |
| echo_event            |
| echo_notification     |
| echo_target_page      |
| aft_feedback          |
+-----------------------+

mysql:research@dbstore1005.eqiad.wmnet [enwiki]> select distinct table_schema from information_schema.tables where table_schema not in (select table_schema from information_schema.tables where table_name = 'echo_target_page');
+--------------------+
| table_schema       |
+--------------------+
| cognate_wiktionary |
| flowdb             |
| information_schema |
| votewiki           |
| wikishared         |
+--------------------+
5 rows in set (0.06 sec)

mysql:research@dbstore1005.eqiad.wmnet [enwiki]> show tables from wikishared;
+---------------------------------------+
| Tables_in_wikishared                  |
+---------------------------------------+
| bounce_records                        |
| cx_corpora                            |
| cx_lists                              |
| cx_suggestions                        |
| cx_translations                       |
| cx_translators                        |
| echo_unread_wikis                     |
| reading_list                          |
| reading_list_entry                    |
| reading_list_project                  |
| urlshortcodes                         |
| wikimedia_editor_tasks_counts         |
| wikimedia_editor_tasks_keys           |
| wikimedia_editor_tasks_targets_passed |
+---------------------------------------+

So the list of impacted components is Cognate, StructuredDiscussions (Flow), MediaWiki-extensions-BounceHandler, ContentTranslation, Reading List Service, MediaWiki-extensions-UrlShortener, WikimediaEditorTasks. (aft_feedback is ArticleFeedbackv5, but I'm pretty sure that's just lack of cleanup for an undeployed extension.) Might want to add @Mholloway to the notification list, for Wikimedia Editor Tasks. (@Legoktm is marked as maintainer of BounceHandler so that's already covered.)

Thanks a lot @Tgr I will tag those (better to tag them and they can remove themselves if it no longer applies) and update documentation accordingly.
Thanks again, very useful!

Change 519185 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/519185

Change 519186 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Change x1-master to the new master

https://gerrit.wikimedia.org/r/519186

Change 519187 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/519187

Regarding Cognate going read-only, I want to point out to T187960#4998807 (I can run the maintenance script after it's done, just ping me) and T194141 (It can be problematic but I'm not sure). CC @Lydia_Pintscher @WMDE-leszek @darthmon_wmde

@Ladsgroup I believe that last time it wasn't necessary, but I am not 100% sure

@Ladsgroup I believe that last time it wasn't necessary, but I am not 100% sure

I can run it, it's fine. Just drop me a ping

@Ladsgroup I believe that last time it wasn't necessary, but I am not 100% sure

I can run it, it's fine. Just drop me a ping

Will do - thanks!

Trizek-WMF subscribed.

@Marostegui, which wikis are affected? Only English Wikipedia?
Do you need to display a banner too?

@Marostegui, which wikis are affected? Only English Wikipedia?
Do you need to display a banner too?

All the wikis really, as x1 holds the echo_tables (described above) for all the wikis.

Banner set. It will be displayed starting at 05:00 UTC July 3 on all wikis. End at 06:20 UTC.

Mentioned in SAL (#wikimedia-operations) [2019-07-03T05:02:01Z] <marostegui> Start pre-failover steps for x1 - T226358

Change 519185 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/519185

Change 519187 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/519187

Mentioned in SAL (#wikimedia-operations) [2019-07-03T06:00:24Z] <marostegui> Starting x1 failover from db1069 to db1120 - T226358

Mentioned in SAL (#wikimedia-operations) [2019-07-03T06:01:57Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover x1 master eqiad from db1069 to db1120 T226358 (duration: 00m 27s)

Change 519186 merged by Marostegui:
[operations/dns@master] wmnet: Change x1-master to point to the new master

https://gerrit.wikimedia.org/r/519186

This was done.
Read only start: 06:00:36 UTC
Read only stop: 06:01:56 UTC
Total read only time: 01:20 min

Mentioned in SAL (#wikimedia-operations) [2019-07-03T10:36:34Z] <Amir1> start of ladsgroup@mwmaint1002:~$ foreachwikiindblist wiktionary extensions/Cognate/maintenance/populateCognatePages.php (T226358)