Page MenuHomePhabricator

RFC: Serve Main Page of Wikimedia wikis from a consistent URL
Open, MediumPublic

Description

  • Affected components: Wikimedia site configuration.
  • Engineer for implementation: @Krinkle, @Ladsgroup.
  • Code steward: Core Platform Team.

Motivation

Current issues:

  • Accessing wiki projects by domain results in a redirect. (Subpar performance)
  • Address bars, urls and search results for our projects prominently expose the inconsistent naming conventions of each wiki. (Subpar user experience)
  • SEO. "Avoid Landing Page Redirects", Google PageSpeed, https://developers.google.com/speed/docs/insights/AvoidRedirects.
  • Difficulties with tooling. Performance tests are difficult to write in a way that targets a normal view of a main page without a redirect, due to the url not being deterministic or consistent. (Current workarounds: Using a ?whatever query string, which will serve the Main Page as the default title without redirect).
  • Monitoring such as "Is the Main Page for all projects up and responding content?" is not trivial, as simplistic tools do not follow redirects or consider a 301 it as success, even if the actual page with a random url is returning an error. In some places, Main_Page as a redirect is sometimes deleted, leading to false alarms.
Requirement

Serve the main page of WMF wikis from a consistent URL, one that does not vary by wiki configuration, site language, or local interface message overrides.


Exploration

Stakeholders
  • Traffic team. (assert potential routing impact)
  • Reading Web team. (about SEO, and reader user experience)
  • Performance team. (believed to improve performance)
  • Core Platform Team. (core behaviour being utilised that previously has only been used by low-traffic wikis and third-parties)
  • Wikimedia communities via Tech News and Community Engagement team. (identify potential impact on technical workflows we may not be aware of, so that we may help accommodate those)
Status quo

The URL to WMF wiki main page varies by wiki configuration (site language, or hooks), and interface message overrides locally to the wiki. For example:

The following are HTTP 301 redirects to https://en.wikipedia.org/wiki/Main_Page:

The following are HTTP 301 redirects to https://fixcopyright.wikimedia.org/:

Examples of affected links:

  • Portals, such as https://www.wikipedia.org and https://www.wikimedia.org.
  • Language links in the sidebar of the main pages themselves.
  • Interwiki links, such as [[mw:]], or [[wikitech:]].
  • Browsing directly by entering the hostname of a wiki project.
  • Browsing by changing homepage address of one project into another (usually leads to a 404 Not Found, as "Wikipedia:Hauptseite" would not exist on nl.wikipedia.org).
Performance data

From Navigation Timing, over February 2019:

stat1007/hive
-- sampled views to enwiki/Main_Page
SELECT COUNT(*),SUM(event.redirecting) FROM event.NavigationTiming WHERE year=2019 AND month=2 AND wiki="enwiki" AND event.revId=870437359 AND event.action="view" AND event.isOversample=false;
-- sampled views to enwiki/Main_Page that involved a redirect
SELECT COUNT(*),SUM(event.redirecting) FROM event.NavigationTiming WHERE year=2019 AND month=2 AND wiki="enwiki" AND event.revId=870437359 AND event.action="view" AND event.isOversample=false AND event.redirecting != 0;
Sampled views Sampled views (redirected) Time spent redirecting
31,734 9,703 840.039 s

This is from a 1:1000 sampling. This means that in February 2019, the Main Page had an estimated 31 million views from Grade A web browsers that completed their page load. Of these, over 9.7 million page views (30.5%) experienced a redirect. They cumulatively spent 233,344 hours (or over 27 years) waiting for a redirect (about 0.1 s each, on average).

Proposal

I'd like us to consider changing the canonical URL to a the main page of Wikimedia wikis to be the domain root. This means https://www.wikidata.org/ would serve what we currently see at https://www.wikidata.org/wiki/Wikidata:Main_Page, for example.

MediaWiki provides a hook that allows the canonical url for a given title to be customised. This has been in use at translatewiki.net since 2015 (written about on Nixlas' blog, source code), and also used at WMF for the Fix Copyright campaign in 2018.

Once configured, all canonical access to the main page is automatically reflected accordingly by MediaWiki.

  • The link to the main page in the sidebar and on the logo will point to this.
  • When browsing from the talk page, what links here, history page, contributions, search results, it points to the canonical url.
  • When creating an internal link to it in wikitext like [[Main_Page]] this results in the correct HTML for an anchor link to the canonical url (e.g. <a href="/" title="Main Page">Main Page</a>).
  • When editing the main page, the purges sent to the CDN layer will be for the canonical url, as expected.
  • When manually browsing to /wiki/Main_Page, MediaWiki's router normalises this to the canonical url in the form of a HTTP 301 redirect. MediaWiki will serve the page as usual without redirect, and <link re=canonical> set to the canonical url, the same way we do for other non-canonical urls and article redirects (per T120085#5345448).
  • Configuration variables in JavaScript like wgIsMainPage and server-side checks like Title::isMainPage() all work as expected.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In T120085#5517249, @Izno wrote:

This one will probably require a user notice before WMF rollout [..]

This is still an open RFC. Consultation with the community will be part of this RFC, including asking for input and feedback through Tech News before anything is approved, implemented or rolled out.

In T120085#5517249, @Izno wrote:

It looks like the above patch just adds the config option, so that it can be an option for 1.34 users. Does the old way to do this need deprecation notices?

There is no old way.

The old way was basically to write this manually through custom PHP code. This (experimental) configuration variable provides that same code as part of core now.

This is still an open RFC. [snip]

Totally missed this was in the RFCs bucket. (Probably just used to seeing RFC in the name of the task as with recent RFCs.)

Nikerabbit renamed this task from Serve Main Page of WMF wikis from a consistent URL to RFC: Serve Main Page of WMF wikis from a consistent URL.Sep 24 2019, 6:41 AM
awight renamed this task from RFC: Serve Main Page of WMF wikis from a consistent URL to RFC: Serve Main Page of Wikimedia wikis from a consistent URL.Sep 27 2019, 6:52 AM

I like the end result here, and I don't think it's problematic from the Traffic perspective in the long view, but I think the initial rollout isn't so trivial: [redirect loops]

We talked about this with @Tgr in the hackathon and one easy way to bypass the issue of the redirect loop is to serve the main page through both endpoints […]

Thanks, excellent point. I've adjusted the proposal to not redirect the old URL, but to keep it as-is, the same way we do with article redirects and other non-canonical URL representations. E.g. serve normally as 200 OK, but with <link rel=canonical> set to the canonical url, and with JS-rewrite of the address bar to the canonical variant as well.

From task description

Stakeholders:

  • Traffic team. (assert potential routing impact)
  • Reading Web team. (about SEO, and reader user experience)
  • Performance team. (believed to improve performance)
  • Core Platform Team. (core behaviour being utilised that previously has only been used by low-traffic wikis and third-parties)
  • Wikimedia communities via Tech News and Community Engagement team. (identify potential impact on technical workflows we may not be aware of, so that we may help accommodate those)

@BBlack has commented from Traffic. They raised no blocking concerns, and their feedback has resulted in a change to the proposal to not redirect the old URL (T120085#5345448, T120085#5539830).

Myself on behalf of Performance have already provided data to support the change and have no concerns either.

I've reached out to Community Engagement by e-mail to ask for their feedback and outreach.

I've tagged Reading-Web and CPT on the task here for their feedback from product and technical perspective.

How would you phrase this for inclusion in Tech News?

I've tagged Reading-Web and CPT on the task here for their feedback from product and technical perspective.

If I understand correctly we'll be choosing en.m.wikipedia.org as the canonical link for the main page. Given this is the most likely URL a user will enter and currently it redirects to /wiki/Main_Page it seems like this would reduce the amount of indirection to visitors to the main page which seems a good thing in terms of experience. In terms of SEO, I'm not sure how we could measure any impact here technically and whether it's worth it. Did you have any specific thoughts/concerns?

I've tagged Reading-Web and CPT on the task here for their feedback from product and technical perspective.

If I understand correctly we'll be choosing en.m.wikipedia.org as the canonical link for the main page.

@Jdlrobson clarifying question: en.m.wikipedia.org as the canonical link for the main page *on mobile*, correct? I'm pretty sure that's what you meant, but I don't want to assume.

Yup. en.wikipedia.org for desktop (which redirects to mobile en.m.wikipedia.org).

Change 540678 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Set $wgMainPageIsDomainRoot true for fixcopyrightwiki

https://gerrit.wikimedia.org/r/540678

Change 540679 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Get rid of main page hack for fixcopyrightwiki

https://gerrit.wikimedia.org/r/540679

Krinkle added subscribers: Esanders, ssastry.

@ssastry, @Esanders Hi - could you review this RFC for potential impact on Parsoid and VisualEditor?

The current proposal would make the canonical url for [[Main Page]] on most wikis result in <a href="/"> instead of <a href="/wiki/Main_Page">. I imagine this might impact Parsoid and/or VisualEditor if there are assumptions made about being able to reverse-engineer urls based on wgArticlePath (instead of the API deciding what urls are). Note that compliance is entirely optional, in that the old URL will continue to work, and it will continue to be valid to create URLs based on wgArticlePath. What changes is that canonical URLs created elsewhere (e.g. by the API) may be different for the Main Page.

So the question is whether it would be a problem if API responses start advertising this url alongside page titles. For example, from prefix search.

From Parsoid's perspective:

  1. is $wgMainPageIsDomainRoot available in SiteInfo? Parsoid/JS and the non-integrated mode of Parsoid/PHP would need this.
  1. currently [[Main Page]] yields <a href="./Main Page">. It sounds like that's fine for initial deployment, but if we eventually want this to yield <a href="../"> or <a href="/"> it complicates the task of recreating the title of a link from the A tag. We'd have to audit all the places that do that and ensure they all handle this case correctly, and only after doing so we could deploy a change that uses $wgMainPageIsDomainRoot from site info to emit <a href="../"> in the appropriate circumstances.

Something like this for Tech News? (Plus links and clearer handling of URLs.)

The URL of the main page of the Wikimedia wikis could be changed. This is because the way it is done now leads to several problems. For example https://www.wikidata.org/wiki/Wikidata:Main_Page would be https://www.wikidata.org instead. You can tell the developers if this would cause problems for your wiki.

This has now been added to Tech News.

In T120085#5545615, @Johan wrote:

[...] For example https://www.wikidata.org/wiki/Wikidata:Main_Page would be https://www.wikidata.org instead. You can tell the developers if this would cause problems for your wiki.

@Johan: Technically, given the discussion above, wouldn't it be more accurate to say

would be https://www.wikidata.org/ instead.

with a trailing slash?

Change 540971 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Export $wgMainPageIsDomainRoot in siteinfo API

https://gerrit.wikimedia.org/r/540971

In T120085#5548420, @Dcljr wrote:
In T120085#5545615, @Johan wrote:

[...] For example https://www.wikidata.org/wiki/Wikidata:Main_Page would be https://www.wikidata.org instead. You can tell the developers if this would cause problems for your wiki.

@Johan: Technically, given the discussion above, wouldn't it be more accurate to say

would be https://www.wikidata.org/ instead.

with a trailing slash?

Not really. https://www.wikidata.org/ and https://www.wikidata.org are two different ways of writing the same URL. I think most web-browsers will normalize to https://www.wikidata.org without the trailing /. Firefox and Chrome do on www.wikipedia.org

Not really. https://www.wikidata.org/ and https://www.wikidata.org are two different ways of writing the same URL.

Yes, but what is the software actually doing?

In T120085#5548824, @Dcljr wrote:

Not really. https://www.wikidata.org/ and https://www.wikidata.org are two different ways of writing the same URL.

Yes, but what is the software actually doing?

Im not sure what you mean. I dont think its possible to distinguish in mediawiki between these 2 urls. They are both different ways to say the same thing: the path part of the url is empty. Im not sure about http/2 off the top of my head (i expect its the same) but in http 1.1 its impossible to distinguish between the 2 from mediawiki as both result in GET / HTTP/1.1. (i.e. when you visit a website the url is split at the / and the part before the / is transmitted seperately from the part after the / before the #)

So its basically up to thd browser what to display. Firefox and chrome seem to chose to remove the trailing /, as can be seen at https://www.wikipedia.org/ or https://translatewiki.net/

In T120085#5548824, @Dcljr wrote:

Yes, but what is the software actually doing?

Im not sure what you mean.

Change 520139, merged on Sep 23, adds this to MediaWiki.php:

if ( $this->config->get( 'MainPageIsDomainRoot' ) && $request->getRequestURL() === '/' ) {
  return false;
}

and this to Title.php:

if ( $wgMainPageIsDomainRoot && $this->isMainPage() && $query === '' ) {
  return '/';
}

I am not a developer, but this looks to me like the software is using the single slash to indicate the root document.

Yes, but when you visit the site it will get removed (in the interface). To put it another way, the / is used behind the scenes, but anything the user sees will not use the /.

How will this work for projects with a different main page for each language, eg Commons? The main page depends on the user's interface language. Normally, if you're a French-language user and you navigate to https://commons.wikimedia.org/ , you get redirected to https://commons.wikimedia.org/wiki/Accueil . Will https://commons.wikimedia.org/ still show the correct content to each user?

How will this work for projects with a different main page for each language, eg Commons? The main page depends on the user's interface language. Normally, if you're a French-language user and you navigate to https://commons.wikimedia.org/ , you get redirected to https://commons.wikimedia.org/wiki/Accueil . Will https://commons.wikimedia.org/ still show the correct content to each user?

That's a good point. I think most of this will work fine, however we may have cache pollution issues if someone with their language set to 'fr' writes [[Accueil]] on a page (If the page doesn't have an {{int: on it or otherwise is marked as varying by user language.)

This should definitely be tested with $wgForceUIMsgAsContentMsg = ['mainpage']; set

Very valid point, I personally would be okay with not turning on the config on wkis that set $wgForceUIMsgAsContentMsg = ['mainpage']; (like commons, wikidata, etc.)

Change 540678 merged by jenkins-bot:
[operations/mediawiki-config@master] Set $wgMainPageIsDomainRoot true for fixcopyrightwiki

https://gerrit.wikimedia.org/r/540678

Change 540679 merged by jenkins-bot:
[operations/mediawiki-config@master] Get rid of main page hack for fixcopyrightwiki

https://gerrit.wikimedia.org/r/540679

Mentioned in SAL (#wikimedia-operations) [2019-10-07T11:42:45Z] <lucaswerkmeister-wmde@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:540678|Set $wgMainPageIsDomainRoot true for fixcopyrightwiki (T120085)]] (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2019-10-07T11:44:18Z] <lucaswerkmeister-wmde@deploy1001> Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:540679|Get rid of main page hack for fixcopyrightwiki (T120085)]] (duration: 00m 52s)

Just want to emphasis that this config variable at this current state redirects /wiki/Main_Page to / and will cause redirect loops if we just turn it on, we need to make the config not redirect to the canonical place before moving forward.

MediaWiki does not HTTP redirect (at least not in translatewiki.net). Wikimedia has rewrites outside MediaWiki for this, right?

MediaWiki does not HTTP redirect (at least not in translatewiki.net). Wikimedia has rewrites outside MediaWiki for this, right?

Yes, I think they are apache redirects, T120085#5345448 Maybe we can make the redirects internal (turn them into rewrite). @BBlack knows better and explained some details in T120085#5345448 but maybe he can explain more

Very valid point, I personally would be okay with not turning on the config on wkis that set $wgForceUIMsgAsContentMsg = ['mainpage']; (like commons, wikidata, etc.)

To make it even more complicated, Wikidata redirects (or at least wants to redirect) all main pages to https://www.wikidata.org/wiki/Wikidata:Main_Page using wiki redirects (see https://www.wikidata.org/w/index.php?title=Wikidata:Hauptseite&action=edit for example) and uses in-page i18n, so this change would be safe to implement on WD (but not on Commons, MediaWiki.org etc.).

We're in peak fundraising season now, and I'm worried this might affect links to https://donate.wikimedia.org.

@DStrine Can someone from Fundraising Tech investigate this to see if it would cause any problems on donate or payments?

@Pcoombe I don't think this will go live before January, but if it helps, let's just exclude any and all changes from donatewiki!

I'd still very much like feedback from FR-Tech as the unique set up of donatewiki could expose additional compatibility concerns we need to consider, but I'd be fine with hearing those (and incorporating them) after January, possibly after it has gone live on other wikis already. We can keep iterating. It's also equally likely that in January we'll find there are no concerns unique to donatewiki, in which case we'll flip the switch there later at that time.

So if worried about prioritisation, feel free to push this back within FR-Tech :)

Change 540971 merged by jenkins-bot:
[mediawiki/core@master] Export $wgMainPageIsDomainRoot in siteinfo API

https://gerrit.wikimedia.org/r/540971

So the question is whether it would be a problem if API responses start advertising this url alongside page titles. For example, from prefix search.

It would require fixes to a fair few areas of code. It would be easier if we just kept it as "Main Page" for the purposes of Parsoid, VE and API results.

Code steward of the core feature TBD. It's a pretty minor feature, but worth double-checking that it's a feature we're okay with keeping long-term and that there's a fallback to address issues in case if I'm unavailable (assuming Perf won't own the core feature).

I'll check in with CPT on this before moving the RFC forward.

Krinkle moved this task from P2: Resource to P3: Explore on the TechCom-RFC board.

Roadmap alignment and any stewardship needs from CPT confirmed by Cindy.

Removing my team, I don't think there's anything for us here?

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!