Page MenuHomePhabricator

Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need
Closed, ResolvedPublic21 Estimated Story Points

Description

The analytics team is going to use MediaWiki eventbus data to incrementally build up its edit dataset.
This task is about making sure the events contain all the needed data for the first version of the datasets.

Event Timeline

JAllemandou triaged this task as Medium priority.May 5 2016, 4:27 PM
JAllemandou set the point value for this task to 13.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
mforns changed the point value for this task from 13 to 21.May 10 2016, 4:08 PM

Change 288210 had a related patch set uploaded (by Joal):
Update existing schemas for analytics need

https://gerrit.wikimedia.org/r/288210

CR in progress, use code to talk, no need to task.

3 schemas being altered in https://gerrit.wikimedia.org/r/#/c/288210 require knowledge of the previous state of the object being altered.

Q: Can Mediawiki give us this information, and if not, can we modify Mediawiki to do so?

page_move

I think this one should be easy. The TitleMoveComplete hook that is used by EventBus extension provides us with both old and new Revision objects. I think all of the info needed here is in each of the Revision objects, so we should be able to just set this in the extension with no Mediawiki core modifications.

revision_visibility_change

I don't fully understand how the Mediawiki code for this one works. [[ https://github.com/wikimedia/mediawiki/blob/master/includes/revisiondelete/RevDelList.php#L106 | setVisibility() ]] in RevDelList.php is responsible for this. It seems to loop through a list of IDs, [[ https://github.com/wikimedia/mediawiki/blob/master/includes/revisiondelete/RevDelList.php#L144 |merge the $oldBits with the $bitPars ]] from the form to get the new visibility bitmap. This is then saved to the db.

What I don't understand is that $this->updateLog() is called after the for loop finishes, with both $oldBits, $newBits, and $idsForLog. If the for loop is merging an id's $oldBits with the ones from the form, how are $oldBits or $newBits going to mean anything real here? They are both set within the for loop. It seems to me that these will just have the values of the last id that was looked at in the for loop.

Anyway, if I am misunderstanding something, and both $oldBits and $newBits are useful after the for loop, and consistent across all of the ids that are being changed, then it should be possible to pass these to doPostCommitUpdates(), which eventually runs the ArticleRevisionVisibilitySet hook.

user_blocks_change

The hook we would be able to use now to get this info is [[ https://www.mediawiki.org/wiki/Manual:Hooks/BlockIpComplete | BlockIpComplete ]]. As is, this hook does not provide us with the previous state. We'd have to make modifications to SpecialBlock.php to get this. The code in [[ https://github.com/wikimedia/mediawiki/blob/master/includes/specials/SpecialBlock.php#L255 | maybeAlterFormDefaults() ]] does look up previously saved block info from the db before submitting the form. We could save the $block object returned by Block::newFromTarget( $this->target ); on the SpecialBlock object, and then pass it to the BlockIpComplete hook later.

@aaron can we find some mins to discuss above? I'm not sure if you are the right person, but if not you can you suggest someone?

Talked with @aaron about this in IRC.

Apparently I am not misunderstanding $oldBits and $newBits in that code. As they are passed to $this->updateLog, they are not useful. In order to get something useful to ArticleRevisionVisibilitySet, Aaron suggested we maintain a map of id -> ($oldBits, $newBits) as we iterate through the loop, and then either pass this to doPostCommitUpdates(), or save it on $this, and then pass it to the ArticleRevisionVisibilitySet hook inside of doPostCommitUpdates().

For all of these ideas, Aaron says

seems fine

Sounds like an emphatic go head if I've ever heard one! ;)

Hm, @Eevans and @mobrovac, got a Q.

Over at https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/EventBus.hooks.php#L247:

'hidden' => [
    'text' => $revision->isDeleted( Revision::DELETED_TEXT ),
    'sha1' => $revision->isDeleted( Revision::DELETED_TEXT ),
    ...

The text and sha1 fields in revision_visibility_set are both set to the same value. As far as I can tell, the form doesn't allow specifically setting the visibility of the sha1 of a revision. Is this useful at all? Can we remove sha1 from the new revision_visibility_change schema?

@Ottomata these two fields are redundant in a sense, but they exist because when you are viewing the history you can see the SHA1. Alas, there seems to be a rule that says that if the revision text itself is hidden, then its SHA1 should be too, which makes sense as it minimises information leakage potential. Whether these two fields are ever used in contents other than the presentation level, that I don't know.

What would be the benefit in removing it?

The sha1 field is very useful. I use it to detect reverts -- edits which fully restore a page to the previous version and thus have the same sha1. Also, +1 to hiding it when the text is suppressed.

If it has meaning, I'm all for keeping it. In the current EventBus hook though, it is set to the same value as Revision::DELETED_TEXT, so it will never have a different value. I haven't seen in the code or in a form a place where the sha1 is ever selectively hidden, so I think this field doesn't have any meaning.

Note that we are talking about a boolean that indicates whether or not the sha1 of a revision is visible, not the actual sha1 value of the revision.

If it has meaning, I'm all for keeping it. In the current EventBus hook though, it is set to the same value as Revision::DELETED_TEXT, so it will never have a different value. I haven't seen in the code or in a form a place where the sha1 is ever selectively hidden, so I think this field doesn't have any meaning.

Right, because the logic is: if the text can't be seen, the SHA1 shouldn't be visible either. So, no, it is not possible for it to have a different value. But, as @aaron points out, it can prove to be useful in different contexts even if semantically it provides no added value to the event itself.

Patch updated, see my comments in gerrit.

In EventBus meeting yesterday we decided to remove the sha1 visibility boolean. It doesn't provide any meaning beyond text. We can always add it later if it turns out we need it for some crazy reason.

Ah! We need another MW Core change. We'd like to know the number of revisions archived during a page_delete. We'll need to somehow pass in $dbw->affectedRows() from the archive insert to the ArticleDeleteComplete hook. Patch coming soon...

Change 298548 had a related patch set uploaded (by Ottomata):
Pass $archivedRevisionCount to the AritcleDeleteComplete hook

https://gerrit.wikimedia.org/r/298548

Ah, we will need a change for page_is_redirect too.

OK, another change to MW Core needed too. In order to get overwritten_redirect_page_id, we need to save this info before the overwritten redirect page is deleted during the page move. Patch coming...

Change 299008 had a related patch set uploaded (by Ottomata):
Clone WikiPage before delete and pass the cloned copy to ArticleDeleteComplete

https://gerrit.wikimedia.org/r/299008

There are a lot of ongoing discussions about proper schemas. Holding on the work for this while we discuss.

Change 301284 had a related patch set uploaded (by Ottomata):
Add page, revision, and user change schemas for more consistent change modeling

https://gerrit.wikimedia.org/r/301284

Hey yall, https://gerrit.wikimedia.org/r/301284 should be able capture all of the information that the Analytics team needs for the edit project, still contain everything Services is using from the existing schemas, and also be more consistent with how changes to Mediawiki entities are modeled.

I know we've been spinning our wheels on this for a while. After a couple of weeks of back and forth with Dan, Marcel and Joseph, we paused on the eventbus schema changes to let them better understand what is is they need from an event stream in order to reconstruct a full edit history. Dan and and I met up in person today and spent some hours discussing this. These schemas don't yet contain everything they will eventually need (e.g. page restrictions/protections), but should make it easier to add them in the future.

This will need more discussion and thought, and perhaps no one will like it!

@Halfak, as usual your input here will be crucial. I'm pretty sure these schemas will be able to cover most of the Mediawiki events in your proposal. (RevisionsDeleted might be hard.)

@mobrovac and @Pchelolo, since these are new schemas entirely, they are not 'backwards incompatible'. But, if we go with these, probably changeprop will want to switch so we can eventually stop producing the other events. Let's talk.

Hm.. I don't really understand the idea here. How are the new schemas better then extending the old ones? In my POV the new schemas are way less restrictive and we can validate way less stuff with them. For example for a revision_change event the revision_visibility field is optional, however if the action is 'restrict revision', that field should be required. You could easily find more examples when the new schema is validating less.

As I understand the idea here is to generalise the schemas and don't repeat yourself, right? Ideal solution would be to inherit one schema from another in that case: you have a generic mediawiki/event schema with meta property, then you inherit a mediawiki/revision_change from it, and inherit specific action schemas to add properties. Unfortunately this is only possible with v5 draft while we're still using v4. (see https://github.com/daveclayton/json-schema-validator/wiki/v5:-merge)

Its more than just DRYing them up. Repetition wasn't a consideration as Dan and I discussed this today. But, you raise a good point. A user_blocks_change event of user_change schema doesn't validate the existence of the visibility object.

I think the idea is to make change modeling consistent. We're trying to represent changes to Mediawiki entities as events. Do we want to have a new schema for every property that can change about a page? What is a user block? Is it a change to a user entity, or is it a change to the ipblocks table? By reducing the number of entities that can have changes, we simplify and make consistent the way changes are represented.

With these schemas, you can tell if a particular property has changed for a given event by examining the prior_state object. If that object has a field, then you can assume that its value has changed during this event, and that the new value is present in the corresponding top level field. It'd be great if there was a way to do a conditionally required field. That is, if prior_state.blocks is present, then top level blocks is required. But we don't have that. :/ I guess schema merging would be a way to do this too. Hm.

But ja, the validation would be less strict. We don't really lose backwards compatibility though, which is the reason we want validation in the first place. It is true that it is possible to produce a user_blocks_change without a blocks object. But if you want to include information about user blocks, you have to do it in the format specified in the user_change schema. It's not possible for someone to only partially emit a user_change event with a partial blocks object, since that subobject has required fields.

In this case we have tight control of the producer of these events (Mediawiki), so can be sure that we emit proper events to topics. The validation of these events is so we can be sure downstream consumers won't be affected by schema changes in those topics.

With these schemas, you can tell if a particular property has changed for a given event by examining the prior_state object. If that object has a field, then you can assume that its value has changed during this event, and that the new value is present in the corresponding top level field. It'd be great if there was a way to do a conditionally required field. That is, if prior_state.blocks is present, then top level blocks is required. But we don't have that. :/ I guess schema merging would be a way to do this too. Hm.

Since the merge semantics is not a part of JSON-schema v4 draft yet we could easily achieve the same result by copy-pasting pieces of the schema. So, what I propose is to have a strict set of required fields for every xxx_change, but actually define a separate schema for every type of a xxx_change event and copy-paste common fields there. So, the user_blocks_change event will have the same common fields with user_name_change event, but additionally it will have the required new_blocks property.

In that case you will have the same level of consistency for the model, but you will not loose on the 'validationability' of the schemas - they will still be very strict. And also once we're ready to switch the validator to v5 draft we will easily rewrite the schemas with $merge keyword.

In this case we have tight control of the producer of these events (Mediawiki), so can be sure that we emit proper events to topics. The validation of these events is so we can be sure downstream consumers won't be affected by schema changes in those topics.

I don't agree with this argument at all. You can easily take it to extreme and state that since we are controlling the producer we can skip validation completely.. The less strict and 'validatable' the schema is the closer you get to that extreme.

I don't agree with this argument at all. You can easily take it to extreme and state that since we are controlling the producer we can skip validation completely.. The less strict and 'validatable' the schema is the closer you get to that extreme.

Aren't you guys doing that by producing to kafka from change-prop? :D

But anyway, I like your idea! Dan and I passed on it while talking, but felt like just the change schemas were easier and cleaner to deal with. Lemme play with your idea and see what the schemas look like...

Ok, I like it. I'm mostly concerned about enforcing consistency and backwards compatibility, and I think the patch I just pushed makes it clear on how new future MW events should be structured. What do you think?

Change 301481 had a related patch set uploaded (by Ottomata):
[WIP] change schema titles to match their namespaced paths

https://gerrit.wikimedia.org/r/301481

In order to support nested entity based schemas without losing context, I've made a change to the way eventlogging IDs schemas. schema names now contain the full contextual path information, so 'mediawiki/page/create' is a full schema name.

Change 301487 had a related patch set uploaded (by Ottomata):
[WIP] ID schemas with path context, add /v1/schemas endpoint to eventlogging-service

https://gerrit.wikimedia.org/r/301487

Change 298548 merged by jenkins-bot:
Pass $archivedRevisionCount to the ArticleDeleteComplete hook

https://gerrit.wikimedia.org/r/298548

Change 299008 merged by jenkins-bot:
Clone WikiPage before delete and pass the cloned copy to ArticleDeleteComplete

https://gerrit.wikimedia.org/r/299008

Change 301284 merged by Ottomata:
Add page, revision, and user entity based schemas for more consistent change modeling

https://gerrit.wikimedia.org/r/301284

Change 301481 merged by Ottomata:
Change schema titles to match their namespaced paths

https://gerrit.wikimedia.org/r/301481

Change 301487 merged by Ottomata:
ID schemas with path context, add /v1/schemas endpoint to eventlogging-service

https://gerrit.wikimedia.org/r/301487