Page MenuHomePhabricator

[RFC] Balanced templates
Open, Stalled, MediumPublic

Description

(These were originally called "hygienic templates", which got confused with hygienic template arguments. The latter are now called "heredoc" arguments, and "hygiene" is no more.)

As described in my Wikimania 2015 talk (starting at slide 27), there are a number of reasons to mark certain templates as "balanced". Foremost among them: to allow high-performance incremental update of page contents after templates are modified, and to allow safe editing of template uses using HTML-based tools such as Visual Editor or jsapi. More discussion of motivation is at T130567 (and covered in RFC meeting E159).

"Balance" means (roughly) that the output of the template is a complete DocumentFragment: every open tag is closed. Furthermore, there are some restrictions on context to ensure there are no open tags which the template will implicitly close, nor nodes which the HTML adoption agency algorithm will reorder. (More precise details below.)

Template balance is enforced: tags are closed or removed as necessary to ensure that the output satisfies the necessary constraints, regardless of the values of the template arguments or how child templates are expanded.

Properly balanced template inclusion allows efficient update of articles by doing substring substitution for template bodies, without having to expand all templates to wikitext and reparse from scratch. It also guarantees that the template (and surrounding content) will be editable in Visual Editor; mistakes in template arguments won't "leak out" and prevent editing of surrounding content.

Wikitext Syntax
After some bikeshedding, we decided that balance should be an "opt-in" property of templates, indicated by adding a {{#balance:TYPE}} marker to the content. This syntax leverages the existing "parser function" syntax, and allows for different types of balance to be named where TYPE is.

We propose three forms of balance, of which the first and perhaps the second are likely to be implemented initially. Other balancing modes would provide safety in different HTML-parsing contexts, and may be added in the future if there is need.

  1. {{#balance:block}} (informally) would close any open <p>/<a>/<h*>/<table> tags in the article preceding the template insertion site. In the template content all tags left open at the end will be closed, but there is no other restriction. This is similar to how block-level tags work in HTML 5. This is useful for navboxes and other "block" content.
    • Formally: in context preceding template, close p, a, table, h[1-6], style, script, xmp, iframe, noembed, noframes, plaintext, noscript, textarea, select, template, dd, dt, and pre. (Alternatively, close all but div and section.) After template, close all open tags.
  2. {{#balance:inline}} would only allow inline (i.e. phrasing) content and silently delete block-level tags seen in the content. But because of this, it can be used inside a block-level context without closing active <p>/<a>/<h*> in the article (as {{#balance:block}} would). This is useful for simple plain text templates, e.g. age calculation.
    • Formally: In context preceding template, close style, script, xmp, iframe, noembed, noframes, plaintext, noscript, textarea, table, ruby, and select, template. These are the tags which change tokenizer or parser modes. (ruby affects subsequent parsing of rb/rtc/rp/rt.) Wrap the template with <span>...</span>, in order to trigger AFE reconstruction. Inside the template, strip address, article, aside, blockquote, center, details, dialog, dir, div, dl, fieldset, figcaption, figure, footer, header, hgroup, main, menu, nav, ol, p, section, summary, ul, h[1-6], pre, listing, form, li, dd, dt, plaintext, button, a, nobr, hr, isindex, xmp, optgroup, and option. These are the elements which can trigger a close tag to be emitted in body parsing mode.
    • To see the need for <span> wrapping, consider <div><b><i>foo</b>{{template}}</div> where the template is <meta>bar<b>bat</b>. The output with <span> wrapping is: <div><b><i>foo</i></b><i><span><meta>bar<b>bat</b></span></i></div> whereas without span wrapping we'd get <div><b><i>foo</i></b><meta><i>bar<b>bat</b></i></div> -- note that the <span> causes the <i> to precede the template content, instead of migrating inside it.
  3. {{#balance:table}} would allow insertion inside <table> and allow <td>/<th> tags in the content. The exact semantics need to be nailed down; it is possible that the inline mode might be extended to allow safe insertion inside <td>/<th> elements, which might remove some of the need for a special table mode. Templates which wish to insert rows or sequences of cells might still need a special mode.

We expect {{#balance:block}} to be most useful for the large-ish templates whose efficient replacement would make the most impact on performance, and so we propose {{#balance:}} as shorthand for {{#balance:block}}. (The current wikitext grammar does not allow {{#balance}}, since the trailing colon is required in parser function names, but the current patch set accommodates this without too much pain.)

Violations of content restrictions (ie, a <p> tag in a {{#balance:inline}} template) would be errors, but how these errors would be conveyed is an orthogonal issue. Currently bad tags are stripped silently. Some other options for error reporting include ugly bold text visible to readers (like {{cite}}), wikilint-like reports, or inclusion in [[Category:Balance Errors]]. Note that errors might not appear immediately: they may only occur when some other included template is edited to newly produce disallowed content, or only when certain values are passed as template arguments.

Implementation
Implementation is slightly different in the PHP parser and in Parsoid. Incremental parsing/update would necessarily not be done in the PHP parser, but it does need to enforce equivalent content model constraints for consistency.

In both implementations, we begin by recording the balance mode desired by each tranclusion and then adding a synthetic <mw:balance-TYPE> tag around the transcluded content.

PHP parser implementation strategy:

  • In the Sanitizer validate the synthetic <mw:balance-TYPE> tag to prevent forgery in wikitext, but otherwise pass the tag through.
  • Just before handing the output to tidy/depurate, perform a "cheap" parse by splitting on < characters, as the Sanitizer does, and naïvely tracking open/close tags seen on a stack (again, as the Sanitizer already does). When the <mw:balance-TYPE> open/close tag is seen, traverse the open tag stack and emit close tags as needed. Even though this pass is just an approximation of true HTML5 parsing, and doesn't accurately track AFE state or implicitly generated tags (like <tbody>), this has been validated to be sufficient. For example, even though we don't track the implicit <tbody> tag on our naïve stack, it can only be present if there was an outer <table> tag, and emitting </table> is sufficient to close the implicit <tbody>.
  • So far it has not been necessary to access "precise" HTML5 parse information in order to implement balancing. If this is necessary in the future, a pure-PHP implementation of the HTML5 Tree Builder pass has been implemented.

In Parsoid:

  • In the tree builder we have access to a fully accurate open-element stack, so we can emit precisely the correct close tags.
  • If/when PHP switches over to a DOM-based tidy, it might be able to use this same implementation strategy (balancing inside tidy) but it's not a requirement.
  • Testing **

A fuzz tester has been written, based on domino, which generates random sequences of tags and text for template and context, and then evaluates whether the desired semantics hold; that is, whether the following two expressions are equal:

  • tidy(tidy(balance(context)).replace(':hole:', tidy(stripOutsideMarker(balance(template)))))
    • Context and template balanced and tidied in isolation, then template inserted via string replacement
  • tidy(tidy(balance(context.replace(':hole:', stripOutsideMarker(template)))))
    • Template inserted into context, then balanced and tidied.

In this context tidy is just an HTML5 parse and serialize. The context is expected to contain <mw:balance-TYPE>:hole:</mw:balance-TYPE> somewhere inside it. The template is also wrapped with <mw:balance-TYPE> tags. The stripOutsideMarker function removes everything outside the <mw:balance-TYPE> tag. Note that we use tidy twice in the second case, because some tidy transformations are sensitive to the number of times we've tidied -- for example, table fostering can leave nodes in positions where they will be further altered by a subsequent tidy.

This tool has validated the set of tags named in the formal definitions of the balance modes, as well as verifying that the "sloppy parse" done in the PHP implementation yields the same results as a precise parse would.

CAVEAT: This tester does not run the output through "legacy tidy". It is possible that the p-wrapping, empty element removal, and other nonstandard evilness performed by legacy tidy might affect the correctness of the balancing. I will hook up legacy tidy to the fuzz tester to look into this; hopefully the transition from legacy tidy to depurate will also make this consideration moot.

Examples
Here are some examples of the balance transformation:

  1. <p><a href="hello"><mw:balance-block><a href="world">foo<p></mw:balance-block>bar
    • The balancer will transform this to: <p><a href="hello"></a></p><mw:balance-block><a href="world">foo<p></p></a></mw:balance-block>bar
    • An HTML5 parse (or tidy) will transform this to: <p><a href="hello"></a></p><a href="world">foo<p></p></a>bar
    • The block balancing ensured that we didn't have an <a> tag inside an <a> tag.
    • The block balancing ensured that the inner <p> didn't implicitly close an outer <p>.
  2. <p><code><center><mw:balance-inline><span></mw:balance-inline><h1>foo
    • The balancer will transform this to: <p><code><center><span><mw:balance-inline><span></span></mw:balance-inline></span><h1>foo
    • An HTML5 parse (or tidy) will transform this to: <p><code></code></p><center><code><span><span></span></span><h1>foo</h1></code></center>
    • Note that HTML5 implicitly closes the <p> when it encounters <center>. This is why <center> is stripped inside (inline balanced) template contents.
    • Note that the HTML5 "reconstruction of active formatting element list" algorithm adds a new synthetic <code> element before the <span>. The balance algorithm adds a <span> *outside* of the template content, to trigger AFE reconstruction and ensure that AFEs of the context don't leak inside the template.

Deployment
Unmarked templates are "unbalanced" and will render exactly the same as before, they will just be slower (require more CPU time) than balanced templates.

It is expected that we will profile the "costliest"/"most frequently used/changed" templates on wikimedia projects and attempt to add balance markers first to those templates where the greatest potential performance gain may be achieved. @tstarling noticed that adding a balance marker to [[[:en:Template:Infobox]]](https://en.wikipedia.org/wiki/Template:Infobox) could affect over two million pages and have a large immediate effect on performance. We would want to carefully verify first that balance would not affect the appearance of any of those pages, using visual diff or other tools.

Related: T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool, T114072: <section> tags for MediaWiki sections.

Mailing list discussion: https://lists.wikimedia.org/pipermail/wikitech-l/2015-October/083449.html

CURRENT STATUS (2019-07-19): the parsing team's current roadmap postpones implementation of this feature until after the Parsoid/core parser integration is done. However, since the core parser uses a DOM-based tidy now (Remex), it could feasibly be done during tidy in the same way in both the legacy parser and Parsoid.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 279670 had a related patch set uploaded (by Cscott):
WIP: Add {{#balance}} to opt-in to balanced templates

https://gerrit.wikimedia.org/r/279670

RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 6:54 PM
RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 7:34 PM

Updated the RFC to match the current proposed semantics and implementation.

Excuse me if I missed something in the proposal, but I'd like to raise the question of template parameters. Currently, template parameters are wikitext, and can thus contain (unbalanced) HTML tags. How should parameters be treated in balanced templates? Should each parameter be pre-parsed on it's own? Or sanitized? Or do we allow plain text parameters only? Or limited wiki syntax? Structured data?...

Allowing un-balanced wikitext parameters to be used in a balanced template can break it, or at least lead to undesired results.

Change 303431 had a related patch set uploaded (by Cscott):
WIP: Extend 'format' spec to include format strings.

https://gerrit.wikimedia.org/r/303431

ssastry changed the task status from Open to Stalled.Jun 7 2017, 7:21 PM

Sorry, we are pretty overcommitted and this is currently stalled till we finish up some ongoing projects.

Someone asked for a logo.

bitmap.png (660×866 px, 25 KB)

Balanced templates. Gettit?

Or the minimalist version:

{{===}}
Krinkle subscribed.

Moving to backlog as current status is unclear.

If the RFC has a clear desired outcome or problem statement, and resourcing commitment from a team that is interested in wider feedback, input or approval, then move it to the Inbox to let TechCom know :)

Moving to backlog as current status is unclear.

If the RFC has a clear desired outcome or problem statement, and resourcing commitment from a team that is interested in wider feedback, input or approval, then move it to the Inbox to let TechCom know :)

Ya, this is stalled because we don't want to do this in both parsers. But, yes we'll flag this once we are ready to pick this up again.

Note that the core parser has a DOM-based tidy now (Remex) and so this is more feasible to implement in the legacy parser than previously. Our current roadmap still postpones this until after Parsoid replaces the legacy parser, but it would be possible (for instance) to start recognizing the {{#balance}} parser function and emitting linter warnings from remex during tidy, as a first step.

In the recent Tech Conf there was a session about onwiki tooling, which includes templates: T234661.

The topic of balanced templates came up in these discussions a few times as a thing that may help address the concerns that some people have about the performance that will be caused by making templates global.

Is this true? Will moving templates in the direction of being more balanced allow more stable and better performing cross-wiki transclusion?

... Also, how is this related to Scribunto modules? Is there a plan to mark them as balanced? It looks like a consensus is forming that before templates are fully global, it's a good idea to make modules global first. Can modules go global without implementing this Balanced templates RFC, or should the balancing be done first?

The topic of balanced templates came up in these discussions a few times as a thing that may help address the concerns that some people have about the performance that will be caused by making templates global.

I see three things that are enabled by balanced templates:

  • Improving performance of re-parses when templates change. This is related to global templates only in so far as global templates could potentially mean that individual templates are used on more pages.
  • Parsing templates in a context different from the context of the local page. Balanced templates are a precondition to that, but quite a bit more work would be needed. This is where I brought up balanced templates in our conversation, but if I recall correctly, you really want the opposite - evaluation in the local context.
  • Visual editing of the template, as well as rendering of pages for editing, without having to evaluate the templates it contains. This would be very nice to have for e.g. offline editing, but I see no connection to global templates.

It seems to be like balanced templates and global templates touch on some related topics, but don't directly impact each other.

But I might be missing something, I'd be interested to hear @cscott's take.

The topic of balanced templates came up in these discussions a few times as a thing that may help address the concerns that some people have about the performance that will be caused by making templates global.

I see three things that are enabled by balanced templates:

  • Improving performance of re-parses when templates change. This is related to global templates only in so far as global templates could potentially mean that individual templates are used on more pages.
  • Parsing templates in a context different from the context of the local page. Balanced templates are a precondition to that, but quite a bit more work would be needed. This is where I brought up balanced templates in our conversation, but if I recall correctly, you really want the opposite - evaluation in the local context.
  • Visual editing of the template, as well as rendering of pages for editing, without having to evaluate the templates it contains. This would be very nice to have for e.g. offline editing, but I see no connection to global templates.

It seems to be like balanced templates and global templates touch on some related topics, but don't directly impact each other.

But I might be missing something, I'd be interested to hear @cscott's take.

All of that. But, to clarify your point #2 which hints at this, the important piece here is the decoupling of parsing of templates from the page that contains them. Balanced templates are a necessary but not sufficient condition to enable that. But, the decoupling means you can memoize a template's parsed output (all the way to HTML) globally across wikis. That of course requires us to be able to track usage of certain kinds of functionality that prevents that kind of memoization ( time-dependent functionality, any database state like revids, page ids, random numbers, etc.). But I believe that kind of state tracking support already exists in MediaWiki.

All of that. But, to clarify your point #2 which hints at this, the important piece here is the decoupling of parsing of templates from the page that contains them. Balanced templates are a necessary but not sufficient condition to enable that. But, the decoupling means you can memoize a template's parsed output (all the way to HTML) globally across wikis. That of course requires us to be able to track usage of certain kinds of functionality that prevents that kind of memoization ( time-dependent functionality, any database state like revids, page ids, random numbers, etc.). But I believe that kind of state tracking support already exists in MediaWiki.

At the level of the template, there's a flag on PPFrame for things like Cite's <references> to indicate that the output of the template itself depends on something external. But it doesn't look like that's used for things like time-dependent functionality, rev ID, and so on.

At the level of the full page parse, MediaWiki tracks time-dependent functions, access to rev IDs, and so on. But it assumes that those things are constant within the process of any one parse so it doesn't track it at the template level.

There's probably some stuff that's not tracked at all but maybe should be. For example, Scribunto calls math.randomseed( 1 ) for every top-level #invoke so math.random should usually be the same every time, but there are probably ways to get nondeterministic behavior (Reseed with os.clock()? Nested #invokes?).

Also of note is that we'll probably have to reimplement a lot of that tracking in Parsoid/PHP when we get to the point of replacing Parser.php. ;)

My take is that the basic Scribunto processing model (and core code in general) needs to move away from manipulating wikitext strings and instead be returning and combining DOM fragments. Balance falls out of that. This is a change I've already been making to core APIs as part of the tidy process; most methods which add fragments to the page output now tidy the fragments so they are always balanced.

So I think that aspect is mostly orthogonal to global templates (although there are performance considerations, described above). Unless we decide to collect a number of these fundamental changes to Scribunto and call it "scribunto 2.0" with a localizable module name/parameter name/enumeration mechanism and a new library for manipulating wikitext and DOM fragments. If you have to make breaking changes, sometimes it's better for porting to make a number of them at once.

Re-triaging an old RFC that seems to still be active. Please add the process, starting by adding the template to the task description.

Ya, this is stalled because we don't want to do this in both parsers. But, yes we'll flag this once we are ready to pick this up again.

@ssastry: Two years later, is there a vague time frame to share when that would be?

Ya, this is stalled because we don't want to do this in both parsers. But, yes we'll flag this once we are ready to pick this up again.

@ssastry: Two years later, is there a vague time frame to share when that would be?

I suppose we could create an ultra-ultra-epic task above T236809 which could block this one, but I'm not sure it's valuable to try to model that via Phabricator?