Page MenuHomePhabricator

U+00AD SOFT HYPHEN shouldn't be allowed in wiki article titles
Open, LowPublic

Description

I spotted the word techno­determinism:

00000000  74 65 63 68 6e 6f c2 ad  64 65 74 65 72 6d 69 6e  |techno..determin|
00000010  69 73 6d 0a                                       |ism.|
00000014

in wiktionary. Both URL of the article and the h1-title on the page contain it. Hyphen isn't visible in both.

This shouldn't be allowed.

Event Timeline

Yurivict raised the priority of this task from to Needs Triage.
Yurivict updated the task description. (Show Details)
Yurivict subscribed.
Aklapper triaged this task as Low priority.Dec 19 2015, 6:21 PM
Aklapper added a project: MediaWiki-General.
Aklapper set Security to None.

I thought this was fixed in T5696...
(For future reference, please associate a project to tasks. Thanks!)

This may have been fixed for the new articles, but I spotted this ones created on 2015-06-29.

Soft hyphens in titles are bad. I remove them regularly. Blocking or at least creating a warning would be useful.

On the other hand soft hyphens are useful for titles with very long words. With T66528 I suggest to allow to insert soft hyphens into the display title.

Change 393381 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] [WIP] Strip soft hyphens (U 00AD) from title

https://gerrit.wikimedia.org/r/393381

On enwiki there are currently the following titles containing soft hyphens. They are all redirects:

  1. https://en.wikipedia.org/wiki/Baltimore­Washington_Parkway?redirect=no
  2. https://en.wikipedia.org/wiki/Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act?redirect=no
  3. https://en.wikipedia.org/wiki/Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act_of_1980?redirect=no
  4. https://en.wikipedia.org/wiki/Immuni­sation?redirect=no
  5. https://en.wikipedia.org/wiki/India­-Myanmar_relations?redirect=no
  6. https://en.wikipedia.org/wiki/Kendall,_Tay­lor_&_Com­pany?redirect=no
  7. https://en.wikipedia.org/wiki/Kendall,_Tay­lor_and_Com­pany?redirect=no
  8. https://en.wikipedia.org/wiki/Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao?redirect=no
  9. https://en.wikipedia.org/wiki/Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao­melito­katakechy­meno­kichl­epi­kossypho­phatto­perister­alektryon­opte­kephallio­kigklo­peleio­lagoio­siraio­baphe­tragano­pterygon?redirect=no
  10. https://en.wikipedia.org/wiki/Whip­lash_Shaken_Infant_Syndrome?redirect=no
  11. https://en.wikipedia.org/wiki/­?redirect=no
  12. https://en.wikipedia.org/wiki/Œu­v­re?redirect=no

After deploying https://gerrit.wikimedia.org/r/393381 these titles are invalid and get renamed by maintenance/cleanupTitles.php. The redirects can already deleted before deploying because they are superfluously because it exists an article or a redirect with a title without soft hyphens.

There are actually 30 such titles on enwiki, when you count other namespaces. Some are usernames, which makes this a bit awkward, but luckily it seems they are all permanently banned, so while we'll probably need to do something special about them, we won't annoy anyone when they are renamed.

I am currently running this query for this across all wikis to see what we should expect.

select
  page_namespace, page_title, page_is_redirect, replace(a.page_title,'­','') as page_title_new,
  (select count(*) from page b where b.page_namespace=a.page_namespace and b.page_title=page_title_new) as conflicts
from page a where page_title like '%­%'
page_namespace page_title page_is_redirect page_title_new conflicts
2 Impro­­v 0 Improv 1
2 Improv­ 0 Improv 1
2 Neutrality­ 0 Neutrality 1
3 Happy­Troll 0 HappyTroll 0
2 Erwin_Walsh­ 0 Erwin_Walsh 1
2 Erwin_Walsh­­ 0 Erwin_Walsh 1
3 Erwin_Walsh­­ 0 Erwin_Walsh 1
3 Marmotville­ 0 Marmotville 0
2 Uniting_Nations­ 0 Uniting_Nations 0
3 Uniting_Nations­ 0 Uniting_Nations 0
2 Love_Virus­ 0 Love_Virus 1
3 Love_Virus­ 0 Love_Virus 1
3 ­Friendly_AIDS 0 Friendly_AIDS 0
2 Nymph/­lol 0 Nymph/lol 1
4 Articles_for_deletion/Família_fotológica 0 Articles_for_deletion/FamÃlia_fotológica 0
4 Articles_for_deletion/Lip­smackin­thirst­quenchin­acetastin­motivatin­good­buzzin­cool­talkin­high­walkin­fast­livin­ever­givin­cool­fizzin 0 Articles_for_deletion/Lipsmackinthirstquenchinacetastinmotivatingoodbuzzincooltalkinhighwalkinfastlivinevergivincoolfizzin 0
0 ­ 1 0
0 India­-Myanmar_relations 1 India-Myanmar_relations 1
0 Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao 1 Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparao 0
0 Baltimore­Washington_Parkway 1 BaltimoreWashington_Parkway 0
0 Kendall,_Tay­lor_and_Com­pany 1 Kendall,_Taylor_and_Company 1
0 Kendall,_Tay­lor_&_Com­pany 1 Kendall,_Taylor_&_Company 1
0 Lopado­temacho­selacho­galeo­kranio­leipsano­drim­hypo­trimmato­silphio­parao­melito­katakechy­meno­kichl­epi­kossypho­phatto­perister­alektryon­opte­kephallio­kigklo­peleio­lagoio­siraio­baphe­tragano­pterygon 1 Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparaomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon 1
0 Œu­v­re 1 Œuvre 1
0 Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act_of_1980 1 Comprehensive_Environmental_Response,_Compensation,_and_Liability_Act_of_1980 1
0 Comprehensive_Environmental_Response,_Compen­sation,_and_Liability_Act 1 Comprehensive_Environmental_Response,_Compensation,_and_Liability_Act 1
3 HIPPOPOTO­MONSTRO­SESQUIPED­AL­IAN~enwiki 0 HIPPOPOTOMONSTROSESQUIPEDALIAN~enwiki 0
3 ­~enwiki 0 ~enwiki 0
0 Immuni­sation 1 Immunisation 1
0 Whip­lash_Shaken_Infant_Syndrome 1 Whiplash_Shaken_Infant_Syndrome 1

Results for all Wikimedia wikis. page_title_new columns has the title with the soft hyphen removed, and conflicts column indicates whether a page with the "fixed" title already exists on the wiki.

There are 2322 such pages across all our wikis, including 913 on kuwiktionary and the rest spread across 153 other wikis.

798 of these have no conflicts and we could just rename to the version without a hyphen.

1275 are redirects with conflicts. Most of these are probably redirects to the soft-hyphen-less title, but we can't check that with just a SQL query.

249 are non-redirects that have a title conflict. Users will have to deal with these manually.

Change 393381 merged by jenkins-bot:
[mediawiki/core@master] Strip soft hyphens (U+00AD) from title

https://gerrit.wikimedia.org/r/393381

This is done for MediaWiki. I filed T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages about running cleanupTitles.php on Wikimedia wikis.

Unfortunately we reverted the change due to problems with WMF deployment :( Turns out the maintenance script is not great. T195546

Change 493162 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] Strip characters from title

https://gerrit.wikimedia.org/r/493162

Change 493162 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] Strip characters from title

https://gerrit.wikimedia.org/r/493162

Change 582468 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/services/parsoid@master] Strip additional special Unicode characters from title

https://gerrit.wikimedia.org/r/582468

@Fomafix: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!