Page MenuHomePhabricator

Add non-exact title search to Special:Undelete and corresponding API
Closed, ResolvedPublic

Description

Update
This is an instruction page for "undelete archiving" functionality, deployed at http://undeltest.wmflabs.org/.

What does it do?

It implements the functionality of indexing deleted pages via ElasticSearch (CirrusSearch extension). This complements indexing usually available for existing pages, so you can now search for partial and unexact matches for the name of the deleted page in Special:Undelete page.

How can I test it?

Note: You may need to fill a captcha when editing - use word "mellon" for it.

  1. Go to http://undeltest.wmflabs.org/.
  2. Create a new page, for example "Mac and Cheese" - be creative and invent your own name though, if everybody uses the same title it would not give diverse feedback.
  3. Login as Admin with the password described here: MediaWiki-Vagrant docs at number 7.
  4. Delete the page you created in (2).
  5. Go to http://undeltest.wmflabs.org/wiki/Special:Undelete and search for "'''chease'''" (note partial and inexact match) - again, be creative with your own title but not ''too creative'' - the name should be still close to what you are looking for to be found.
  6. Observe that the page deleted in (2) is in the list.
  7. Give us feedback!

It will look and function, something like this:

undelete.png (519×847 px, 74 KB)

What unholy magic is this?

The patches are at:

https://gerrit.wikimedia.org/r/#/c/281078/ (core part)
https://gerrit.wikimedia.org/r/#/c/281077/ (CirrusSearch part)

You are welcome to review/comment.


Original description

As an administrator, I want to search the archive table for deleted pages whose title I don't exactly remember, or are similar in nature -- e.g. "Dr. John Smith", "John Anthony Smith" and "John Smith".

Discussion on en.wp's admin noticeboard.

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey
This proposal received 37 support votes, and was ranked #27 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Search#Provide_a_means_of_searching_for_deleted_pages

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

GSoC '16 and Outreachy-12 have started. The scope of the project is not very detailed. Can it be pushed for the current rounds of GSoC/Outreachy and is anyone willing to mentor this project? Ideally the project should not take more than 2-3 weeks for a senior developer to complete.

@Sumit : I will be interested in participating on this project as an intern for the GSoC '16. How can I ping potential mentors on this task?

@Billghost: I'd recommend you ask this (and any other technical) question on irc://irc.freenode.net#wikimedia-dev (preferred) or https://lists.wikimedia.org/mailman/listinfo/wikitech-l .

Billghost updated the task description. (Show Details)

Hello I am working on implementing the fallback case mentioned in (T109561#1940512). I've gone through https://www.mediawiki.org/wiki/Manual:Database_access and don't see any related documentation on writing full text queries with the databae abstraction layer. Any pointers will be welcome.

Thanks in advance

Hello! I would like to work on this project for this upcoming round of GSoC 2016. Is anyone willing to mentor?

Billghost raised the priority of this task from Low to Medium.Mar 5 2016, 9:55 PM
Billghost raised the priority of this task from Medium to Needs Triage.
In T109561#2088266, @Sumit wrote:

GSoC '16 and Outreachy-12 have started. The scope of the project is not very detailed. Can it be pushed for the current rounds of GSoC/Outreachy and is anyone willing to mentor this project? Ideally the project should not take more than 2-3 weeks for a senior developer to complete.

@Sumit This is definitely something a senior developer could complete in 2 weeks. Possibly its not large enough for gsoc/opw but im bad at judging scope

I've not been seeing anyway to implement full text search directly in SpecialUndelete.php by looking at this document https://www.mediawiki.org/wiki/Manual:Database_access. But I decided to write it using a query directly and I just wanted anyone to look at it through pastebin because I don't seem it correct to submit it as a patch. This is the code: http://pastebin.com/7UU5qySN

Hello! I would like to work on this project for this upcoming round of GSoC 2016. Is anyone willing to mentor?

@Billghost Its possible that this project might not get mentors for this round of GSoC, therefore you are also encouraged to look through other projects in "featured" or those lacking a single mentor in "missing-mentors" columns in Possible-Tech-Projects

Using MATCH AGAINST is indeed probably the direction this would go, although we may want a separate archiveindex table instead of making archive a full-text-search table. (With some sort of hook for extensions to override if they want to do something lucene-y instead of mysql fts). Im unclear on why you have GROUP BY and the COUNT(*). There's also an sql injection in this code if $prefix contains an apostaphe (use $dbr->addQuotes( $prefix) instead of $prefix directly).

I should note it is possible to use MATCH in where clauses using the $dbr->select() method. You can do so using numeric array elements of the third argument ($cond)

For future reference, when pastebinning code, its best to use unified diff format (e.g. the output of git show or git diff or git format-patch --stdout HEAD^ or diff -u originalfilr.php newfile.php. See the man pages of these commands for details)

The specific query is of course just one small part of this bug

(And just to clarify so there is no confusion: I do not intend to mentor this project, or more generally be a mentor for gsoc)

The Wikimedia-Hackathon-2016 starts tomorrow and this task is featured at T119703. We want to use T130776: Wikimedia Hackathon 2016 Opening Session to promote these projects and help recruiting volunteers to work for them.

If this task is ripe for hackathon work, please follow these instructions. If it is not ready, remove it from T119703 in order to avoid volunteers' frustration. Thank you!

Change 281077 had a related patch set uploaded (by Smalyshev):
[WIP] Add deleted archive titles search

https://gerrit.wikimedia.org/r/281077

Change 281078 had a related patch set uploaded (by Smalyshev):
[WIP] Add deleted archive titles search

https://gerrit.wikimedia.org/r/281078

Change 281262 had a related patch set uploaded (by EBernhardson):
WIP index archived titles

https://gerrit.wikimedia.org/r/281262

Change 281262 had a related patch set uploaded (by Smalyshev):
WIP index archived titles

https://gerrit.wikimedia.org/r/281262

What is the status of this task after the Hackathon?

I have tried to summarize the progress on this task at https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/WIP_Wikimedia_Hackathon_2016_post#The_connection_with_the_Community_Wishlist. Is there any beautiful screenshot in Commons that we can reuse? Any place to test what was demoed in Jerusalem?

In T109561#2214001, @Qgil wrote:

What is the status of this task after the Hackathon?

Looking at the patch, this looks like it's awaiting some code review and discussion about the implementation. Pretty routine stuff, but since this isn't within our team's goals that'll happen whenever we have a little downtime.

I have tried to summarize the progress on this task at https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/WIP_Wikimedia_Hackathon_2016_post#The_connection_with_the_Community_Wishlist. Is there any beautiful screenshot in Commons that we can reuse? Any place to test what was demoed in Jerusalem?

That I can't answer. Perhaps @Smalyshev might have a screenshot from his dev instance?

No screenshots unfortunately. We had a demo server but I think I shut it down. I'll check and restore it if it's down.

Using great notes that @Smalyshev wrote, I've overhauled the description and added a screenshot. Here's that one, and another:

undelete.png (519×847 px, 74 KB)

undelete2.png (519×847 px, 75 KB)

It seems to work really well, at finding both matches within the title, and mis-spelled words.

@MER-C and anyone else who has experienced frustration with the missing feature, please could you give this a try, and add your feedback here?

I've posted on the Administrators' noticeboard.

Three things I've noticed from the screenshot:

  1. The description of the text field ("show pages starting with") is now incorrect.
  2. This feature and the old prefix index search should co-exist, like [[Special:Prefixindex]] and [[Special:Search]] do for live pages.
  3. It would be useful to have an API, but this is something that we can live without for now.
  4. Filtering by namespace is required (see below).

To get results in any other namespace except ns0, I need to search e.g. Talk:X where X is the search term. This is not intuitive:

  • Search term "Talk:Page"
    • Talk:Main Page -- FOUND
    • Page talk -- NOT FOUND
  • Search term "Talk Page"
    • Talk:Main Page -- NOT FOUND
    • Page talk -- FOUND

Filtering by namespace and the ability to search more than one namespace are both required in production -- spammers and the like sometimes post their crap in some combination of mainspace, userspace, project space and draft space. Restricting myself to ns0 and testing on a real world scenario (https://en.wikipedia.org/wiki/Wikipedia:Sockpuppet_investigations/Alex9777777):

  • Search term "alex bugatti"
    • Alex Bugatti ( blogger) -- FOUND
    • Bugatti, Alex -- FOUND
    • AlexBugatti -- NOT FOUND
  • Search term "bugatti" -- as above, plus
    • Bugatti (Blogger) -- FOUND
  • Search term "Alex Pechkurov"
    • Alex Pechkurov -- FOUND
    • Alex Pechkurov ( blogger ) -- FOUND
    • Alex Pechkurow -- FOUND
    • Pechkurov Alex -- FOUND
    • Аlex Pechkurov -- FOUND (note weird A)
    • Аlex Pechkurov) -- FOUND (note weird A)
    • Alexey Pechurov -- NOT FOUND
    • Aleks Pechkurov -- NOT FOUND
  • Search term "Pechkurov" -- as above, plus
    • Alexey Pechurov -- FOUND
    • Aleks Pechkurov -- FOUND
    • Pechkurov -- FOUND
    • Pechkurov A.G -- FOUND
    • Печкуроў - Pechkurov -- NOT FOUND

13/15 pages found -- behaving as expected, but "AlexBugatti" should have been found by the first search and "Печкуроў - Pechkurov " by the last.

  • Search term "<script>alert('Boo!');</script>" -- PASS
  • Search term "' OR 1=1 --" -- PASS

Floquenbeam on AN said:

I can see how this could occasionally be pretty useful. I just tried it out for a couple of minutes, just one article in article space. Seemed to handle a reasonable number of typos; 1 (occasionally 2) typos per word, even when each word had a typo in a four word title. Seemed to handle only being given a very small portion of the article title well. I note that it handles typos like "herw" instead of "here" easily, but can't handle homonyms like "hear" instead of "here". Not complaining, as I have no idea how you'd go about doing that, but you wanted feedback so here's some feedback. But overall, yay.

Regarding the progress on this, could this task use help from an Outreachy intern( Dec 6 to March 6 )? Please note that applications are open until Oct - 17.

Let us know if possible at the earliest.
Ideally it should take about 2-3 weeks for an experienced developer to complete the task, in order to qualify as an intern project.
If the scope is wide, it could be worked out as per the internship needs :)

What's the status of this task? It looks like it was almost there at one point :)

@Smalyshev and @EBernhardson would you be interested in continuing to work on this?

I'm wondering what might help to get this done. And, our outreach programs GSOC/Outreachy are coming up as well!

We're still in the same point unfortunately - "almost there". I wonder if we need to really put it on schedule to get it done, because otherwise it just keeps being postponed. @Deskana, what do you think?

We're still in the same point unfortunately - "almost there". I wonder if we need to really put it on schedule to get it done, because otherwise it just keeps being postponed. @Deskana, what do you think?

This task isn't really within our current objectives or goals. However, the benefit to advanced users is clear, and since we're almost there, I think it makes sense to prioritise it and do that little bit more to get it in to production.

Deskana raised the priority of this task from Low to Medium.Feb 10 2017, 10:11 PM
Deskana added a project: Discovery-Search.
Deskana moved this task from needs triage to Current work on the Discovery-Search board.

Notes from some brief discussion on this by @EBernhardson and @Smalyshev: Erik thinks that index management is the primary issue outstanding here. There's a single index for all wikis, so it's not exactly clear when it should be created. We may want to turn it in to one index per wiki, but that may cause some timeout issues... but that's the standard operating procedure, so that may be the best idea. Or maybe a special script?

Change 281078 merged by jenkins-bot:
[mediawiki/core@master] Add deleted archive titles search

https://gerrit.wikimedia.org/r/281078

Change 281077 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add deleted archive titles indexing and search

https://gerrit.wikimedia.org/r/281077

Change 347782 had a related patch set uploaded (by Smalyshev):
[operations/mediawiki-config@master] Enable deleted archive indexing & searching

https://gerrit.wikimedia.org/r/347782

Change 347782 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable deleted archive indexing & searching

https://gerrit.wikimedia.org/r/347782

Mentioned in SAL (#wikimedia-operations) [2017-04-11T23:56:46Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347782|Enable deleted archive indexing & searching]] T109561 PART I (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2017-04-11T23:58:08Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:347782|Enable deleted archive indexing & searching]] T109561 PART II (duration: 00m 45s)

debt subscribed.

Code is in production but not yet enabled, fixing related bugs first.

Change 281262 abandoned by EBernhardson:
WIP index archived titles

Reason:
this functionality has been merged in a separate patch

https://gerrit.wikimedia.org/r/281262