⚓ T337446 Rebuild sanitarium hosts

Subject	Repo	Branch	Lines +/-
wikireplicas: restore pybal monitoring	operations/puppet	production	+8 -12
db1155: Enable notifications	operations/puppet	production	+0 -1
db1154: Enable notifications	operations/puppet	production	+0 -1
service: Disable monitors for wikireplicas	operations/puppet	production	+15 -9
wiki-replicas.sql: Add meta_p GRANT	operations/puppet	production	+1 -0
wiki-replicas.sql: Add heartbeat_p	operations/puppet	production	+1 -0
db1156,db1161,db1196,db1212: Disable notifications	operations/puppet	production	+4 -0
wiki-replicas.sql: Create role	operations/puppet	production	+1 -0

Status	Assigned	Task
Resolved	• Marostegui	T337446 Rebuild sanitarium hosts
Invalid	None	T337721 Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact
Resolved	Ladsgroup	T337734 Investigate if maintain-replica-indexes is still needed
Resolved	• Marostegui	T337811 Check and enable GTID across sanitarium and clouddb* hosts

In T337446#8890117, @Marostegui wrote:

Thanks for the report. It was only on clouddb1021 but not on the others (as I did the transfer) before we found this issue. I have fixed it on the other two, sorry for the inconveniences. Lots of moving pieces on all this.

No apologies necessary, and thank you :-) confirmed working for me/the tool in question.

Is there an estimate for when things'll be fully restored? Y'all are great.

Nintendofan885 subscribed.May 30 2023, 7:59 PM

In T337446#8890288, @SWinxy wrote:

Is there an estimate for when things'll be fully restored? Y'all are great.

If nothing happens, tomorrow everything should be back.
However I will probably rebuild s4 tomorrow (and it will take probably 2-3 days) as I don't fully trust its data anymore since it broke earlier today - I fixed the row manually but there could be more stuff under the hood.

• Marostegui updated the task description. (Show Details)May 30 2023, 8:05 PM

• Marostegui updated the task description. (Show Details)

JJMC89 merged a task: T337805: Global user contributions failed.May 31 2023, 2:40 AM

JJMC89 added a subscriber: Lemonaka.

MJL awarded a token.May 31 2023, 2:41 AM

L3X1 subscribed.May 31 2023, 4:09 AM

• Marostegui updated the task description. (Show Details)May 31 2023, 4:55 AM

s1 is fully recloned, and catching up.

I am going to start with s4 to be on the safe side.

Mentioned in SAL (#wikimedia-operations) [2023-05-31T04:59:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1221 (sanitarium s4 master) T337446', diff saved to https://phabricator.wikimedia.org/P48640 and previous config saved to /var/cache/conftool/dbconfig/20230531-045927-root.json

• Marostegui updated the task description. (Show Details)May 31 2023, 5:02 AM

• Marostegui claimed this task.May 31 2023, 5:11 AM

FatalFit unsubscribed.May 31 2023, 5:55 AM

• Marostegui updated the task description. (Show Details)May 31 2023, 6:05 AM

Change 924772 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1154: Enable notifications

https://gerrit.wikimedia.org/r/924772

Change 924772 merged by Marostegui:

[operations/puppet@production] db1154: Enable notifications

https://gerrit.wikimedia.org/r/924772

• Marostegui updated the task description. (Show Details)May 31 2023, 7:24 AM

1AmNobody24 subscribed.May 31 2023, 7:33 AM

TheresNoTime mentioned this in T337829: Requesting access to ops (or wmcs-roots) for TheresNoTime.May 31 2023, 9:23 AM

• Marostegui updated the task description. (Show Details)May 31 2023, 10:09 AM

Ladsgroup awarded a token.May 31 2023, 10:11 AM

• Marostegui updated the task description. (Show Details)May 31 2023, 10:40 AM

• Marostegui updated the task description. (Show Details)May 31 2023, 11:01 AM

s4 on clouddb1021 has been recloned, added views, heartbeat, grants etc. Once it has caught up I will reclone the other two hosts from it.

• Marostegui updated the task description. (Show Details)May 31 2023, 1:50 PM

• Marostegui updated the task description. (Show Details)May 31 2023, 4:36 PM

• Marostegui updated the task description. (Show Details)May 31 2023, 4:38 PM

Ayoub_ subscribed.May 31 2023, 7:23 PM

TheresNoTime merged a task: T337888: guc.toolforge with database error s4.May 31 2023, 7:27 PM

TheresNoTime added a subscriber: doctaxon.

doctaxon added a project: TaxonBot.May 31 2023, 7:35 PM

doctaxon moved this task from Backlog to Webservice/DB on the TaxonBot board.May 31 2023, 7:38 PM

s4 has been fully recloned, clouddb1019:3314 is now catching up with its master

• Marostegui mentioned this in T337888: guc.toolforge with database error s4.May 31 2023, 8:24 PM

I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.

In T337446#8894111, @Liz wrote:

I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.

Can you give us more details about how to debug this? s1 data is up to date now so the reports should be providing the right data unless there's a queue and/or a cache layer somewhere

• Marostegui updated the task description. (Show Details)Jun 1 2023, 3:58 AM

Change 925286 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1155: Enable notifications

https://gerrit.wikimedia.org/r/925286

Change 925286 merged by Marostegui:

[operations/puppet@production] db1155: Enable notifications

https://gerrit.wikimedia.org/r/925286

LilianaUwU subscribed.Jun 1 2023, 5:42 AM

I am reducing the priority of this as all the hosts have been recloned now and data should be up to date.
We shouldn't be surprised if s6 and s8 (the sections that never break) end up breaking on the sanitarium hosts, as if the problem was 10.4.29, data might have been corrupted there and simply didn't show up yet.
I am going to do some data checking now on the recloned versions before closing this task, hopefully by Monday if everything goes fine in the next few days.

Things might still be slow on some of the tools as we are adding the special indexes used in wikireplicas, that can be tracked at T337734

Just posted on wikitech-l https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/WR36DST3Z6NV4YDZ4ODVPKHEK7VEGM4B/

Don-vip subscribed.Jun 1 2023, 11:24 AM

• Marostegui closed subtask T337811: Check and enable GTID across sanitarium and clouddb* hosts as Resolved.Jun 1 2023, 11:41 AM

In T337446#8886722, @MusikAnimal wrote:

Something should be in Tech News

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably very inaccurate!) is:

For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the secondary databases. This should now be fixed.

Production databases didn't have lag. Only cloud replicas but lags in order of days. Basically it stopped getting any updates for around a week due to data,integrity issues.

Hope that clears it a bit. (On phone, otherwise I would have made an exact phrase to change)

In T337446#8897087, @Quiddity wrote:

In T337446#8886722, @MusikAnimal wrote:

Something should be in Tech News

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably very inaccurate!) is:

For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the secondary databases. This should now be fixed.

Wikireplicas had outdated data and were unavailable for around 1 week. There were periods where not even old data was available.
Tools have most likely experienced intermittent unavailability since Wednesday past week until today. We are still adding indexes, so even though everything is up, slowness in certain tools can be experienced.

This outage didn't affect production.

The issue of slow down is resolved by now. There are still some replicas that don't have the index yet but all of them are depooled so no user-facing slowdown anymore

In T337446#8897087, @Quiddity wrote:

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).

Some tools and bots returned outdated information due to database breakage, and may have been down entirely while it was being fixed. These issues have now been fixed.

Possibly could link to https://en.wikipedia.org/wiki/Wikipedia:Replication_lag but that's English-only.

Thank you immensely @Legoktm that's exactly what I needed. :) Now added. If anyone has changes, please make them directly there, within the next ~23 hours. Thanks.

Quiddity moved this task from To Triage to In current Tech/News draft on the User-notice board.Jun 1 2023, 8:18 PM

Guycn2 unsubscribed.Jun 1 2023, 10:01 PM

Lemonaka awarded a token.Jun 2 2023, 3:25 AM

Ladsgroup closed subtask T337734: Investigate if maintain-replica-indexes is still needed as Resolved.Jun 2 2023, 10:08 AM

The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like T337961) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to @Marostegui who worked day and night in the last week and weekend to get everything back to normal.

MusikAnimal awarded a token.Jun 2 2023, 1:29 PM

In T337446#8898231, @Ladsgroup wrote:

The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like T337961) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to Marostegui who worked day and night in the last week and weekend to get everything back to normal.

Agreed, immense thanks to @Marostegui and also you, Ladsgroup!

I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an incident report, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as PAWS outages.

Also much thanks to especially @Marostegui from my side.

In T337446#8898661, @MusikAnimal wrote:

I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an incident report, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as PAWS outages.

I do think that at the very least we should have some way to recover from severe incidents like these a whole lot faster. Maybe having a delayed replica that we can use as a data source to speed up recovery, or something like a puppet run that preps a 'fresh' replica instance every single day, to make sure all the parts needed for that are known to be good ?

I think this one required too much learning on the job for something this critical, and the sole reason is that it luckily doesnt happen too often, but I think the whole process was too involved for everyone. It was affecting and disrupting too many ppl, and teams, which is I think the point of reference we should be using instead of "it's not production".

s1 looks to be down again. (Edit: Now tracked at T338172)

sd@tools-sgebastion-10:~$ sql enwiki
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

DB-wise things are good:

I think something is broken on network side of things. Please file a separate ticket.

Quiddity moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Jun 7 2023, 5:42 PM

Clement_Goubert mentioned this in T339243: ServiceLVS without monitor breaks spicerack.Jun 15 2023, 2:13 PM

fyi: I started a incident doc at https://wikitech.wikimedia.org/wiki/Incidents/2023-05-28_wikireplicas_lag because it was requested to have this incident in the next incident review ritual on Tuesday. I'll add some more information tomorrow and Monday, but feel free to add anything I missed.

Maintenance_bot removed a project: Patch-For-Review.Jun 15 2023, 4:10 PM

You might want to sync up with @KOfori because he's also started one IR. And I have captured a lot more detailed timeline, so maybe we need to merge both.

ClydeFranklin unsubscribed.Jun 19 2023, 7:19 PM

Maintenance_bot edited projects, added User-notice-archive; removed User-notice.Jun 29 2023, 7:31 PM

• nskaggs mentioned this in T337848: WMCS-roots wiki replica access.Aug 16 2023, 3:35 PM

• Marostegui mentioned this in T344608: WMCS-roots paging responsibilities .Aug 21 2023, 3:16 PM

Lemonaka unsubscribed.Sep 6 2023, 10:27 PM

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Is it safe to assume we're back in a sane state and can turn this back on?

SammiBrie unsubscribed.Sep 15 2023, 7:52 PM

In T337446#9170734, @BBlack wrote:

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Is it safe to assume we're back in a sane state and can turn this back on?

Let's go for it Brandon!

Change 924508 merged by BBlack:

[operations/puppet@production] wikireplicas: restore pybal monitoring

https://gerrit.wikimedia.org/r/924508

Mentioned in SAL (#wikimedia-operations) [2023-09-18T14:04:17Z] <bblack> lvs1020, lvs1018: restarting pybal to re-enable healthchecks for wikireplicas ( T337446 -> https://gerrit.wikimedia.org/r/924508 )

SWinxy unsubscribed.Oct 3 2023, 11:25 PM

taavi closed subtask T337721: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact as Invalid.Jan 22 2024, 9:03 AM

Rebuild sanitarium hosts
Closed, ResolvedPublic

Actions

Description

Details

Related Objects

Search...

Event Timeline

	F37094416: image.png
	Jun 5 2023, 5:49 PM

	F37084024: image.png
	May 30 2023, 10:22 AM

Rebuild sanitarium hosts Closed, ResolvedPublic Actions

Description

Details

Related Objects Search...

Event Timeline

Rebuild sanitarium hosts
Closed, ResolvedPublic

Actions

Related Objects

Search...