Subresource Integrity Addressable Caching

A Collection of Interesting Ideas,

This version:
https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html
Editor:
(Facebook Inc.)
Participate:
File an issue (open issues)

Abstract

This note describes some security and privacy risks associated with content addressable caching in Web user agents, particularly those that may arise from the use of Subresource Integrity metadata to identify cache-eligible resources.

Nothing stops user agents from building such a cache. This note encourages implementations following proper precautions, while also indicating why a correct and normative specification for such a cache is likely unnecessary and perhaps impossible.

1. Introduction

A small number of popular JavaScript frameworks and libraries account for a substantial portion of the network, battery, and time budgets of users of the contemporary Web. Many applications include these libraries even if they only use a small portion of the functionality they provide. Large improvements seem possible if Web User Agents ("UAs") could fetch and compile these libraries a single time. The benefit would be especially large for UAs that do not currently share cached content across origins. (typically to protect user privacy)

If a content addressable cache could be populated prior to user browsing activity, the UA could identify opportunities when a device is idle, charging, and has unmetered network available, to download, parse, and compile script files in wide use in order to increase performance later, especially under slow network conditions.

A historical difficulty of accomplishing this has been that resources may respond differently in different contexts, update at any time, and the same public resource may be hosted in many locations without any way to determine (without actually fetching it) whether it is actually the same content.

The Subresource Integrity [SRI] ("SRI") specification now allows for applications to tag their script dependencies with cryptographic hashes that strongly identify the exact content they want. This naturally raises the possibility of using the SRI hash to key into a content addressable cache in the UA.

Version 1 of SRI explicitly avoided specifying a means of doing this. This document explores why content addressable caching is difficult on the Web platform, the sometimes subtle security and privacy issues it raises, and how UAs might sidestep these difficulties in a non-normative fashion.

Sharing cached resources across origins based on other signals of content identity (such as the "immutable" value for the Cache-Control header) may have similar privacy and security concerns and find this document useful.

2. Timing Leaks

The first difficulty of implementing cross-origin, content addressable caching on the Web platform is that it may leak information about the user’s past browsing. For example, site "A" could load a script with a given hash, then later when the user visits site "B", it could attempt to observe the time it takes to load that same resource and infer whether the user had previously been at "A".

Such timing leaks may exist today in UAs that do not segregate their cache based on the requesting origin or registerable domain, but as a content addressasble cache is shared by design, this issue requires explicit consideration in this context.

3. Deterministic History Leaks

In the presence of a user-specific content addressable cache, "B" might also do the following: <script src='/probe.js?unique_id=1IAMTRACKINGY0U' integrity='sha-256:hash-of-resource-at-a' />

If "B" serves a resource with this markup and observes no request to the probe resource is made, it can infer with near certainty that the user has previously visited "A" because the cache is primed.

4. Origin Laundering

In general, it would seem that the use of a cryptographic hash to identify content to be loaded should prevent cache poisoning. However, there are some circumstances where the properties of a script (or other content which might be similarly identified in the future) depend on the origin from which it is loaded.

To reduce attack surface, Content-Security-Policy [CSP3] ("CSP") allows a resource to specify to the UA from which origins it should load content like scripts. If https://example.com specifies

Content-Security-Policy: script-src 'scripts.example.com' for its resources in order to only allow known scripts from a source it controls, it would be bad if an attacker could inject the following code into a document: <script src='https://scripts.example.com/angular.min.js' integrity='sha-256:deadbeef1234badcafe...' />

And by so doing get the UA to load an old version of the Angular framework from a content addressable cache if the server at https://scripts.example.com origin wouldn’t actually serve that resource.

Origin Laundering of this sort might also allow bypassing restrictions on the loading of Workers or ServiceWorkers, which must be at the same-origin as the resource loading them.

Origin Laundering could create immediate Universal Cross-Site Scripting (UXSS) if it was applied to <plugin> , <object> , or similar subresource types that create a plugin document, as PDFs and other plugin document types often use variant access control policies which allow these document types to assume the origin in their own URL, rather than the origin of the document to which they are a subresource.

A similar UXSS risk would exist if top-level documents (e.g. HTML in an <iframe> ) were eligible for loading from a content aware cache in a manner subject to Origin Laundering.

5. Suggestions for a Solution

Because of these issues, it is difficult to specify SRI-based content addressable caching behavior that depends entirely on the interactions between the user agent and resources.

The difficulty of normatively specifying a neutral algorithm does not mean that we must forgo the potential improvements of such a cache. So long as they do not violate an application’s semantics (including security), UAs have wide latitude to make performance improvements. The performance benefits of a cache need not be uniform, deterministic or even entirely predictable by applications.

A simple solution for timing and history leaks is to not base the population of a content addressable cache on individual user behavior. If loading a script is fast because it is always fast in a given UA, there is no privacy leak.

One possible answer to building SRI addressable caching is for user agents to enable it for pieces of content based on large-scale content heuristics, and never individual user behavior. A number of strategies are possible: crawling and analysis of content on the Web at large and aggregate user browsing behavior based on UA telemetry are two obvious ones. Either approach could enable a user agent to approximate which resources referenced with SRI hashes would provide the most user benefit as a member of a content addressable cache.

A user agent might also simply "pick winners" manually to avoid the security and privacy traps, but it is likely a more sophisticated strategy will produce better results. The exact nature of such a strategy is inherently non-normative and which strategy will achieve the best results may be unstable over time and for different user populations.

If the content addressable cache is pre-populated identically for large user populations, the only information that might be leaked via side-channels is the last update time of the SRI cache, and possibly the size of the storage allotted to this type of cache (if that is allowed to vary statically or dynamically based on the available storage on the device) by examining whether resources at the time and space edges of the cache are primed.

Managing origin laundering requires the cache to be annotated with a list of URLs observed to actually host a given resource, and double-keying the content addressable cache with both the resource hash and its location. Either a crawling or user telemetry based cache population strategy provides the necessary data for this double-keying.

In many circumstances, it may be possible to safely load from the content addressable cache using only the hash key, for example:

5.1. Second-Order Effects

Decisions on what to cache are not content-neutral, or if they begin so, they cannot remain so. Any algorithm will create a performance bias towards things included in the cache, which may have other, possibly detrimental, effects in the long term. It may entrench incumbent, popular frameworks at the expense of innovation, or it might discourage early adoption of security fixes if newly-patched libraries are slower because they have not been pre-cached yet. Conversely, a policy of only caching the latest few versions of a given library might encourage more rapid patching and obsolesence of vulnerable versions.

A privileged place in the cache might encourage providers of popular libraries and frameworks to bundle unrelated features to gain a competitive advantage, or just to evict less popular competing frameworks from a size-limited cache. Whether such effects would materialize, and their magnitude, is impossible to accurately predict, but implementers may wish to take these concerns into consideration when deciding whether and how to implement such a caching algorithm.

References

Informative References

[CSP3]
Mike West. Content Security Policy Level 3. 21 June 2016. WD. URL: https://w3c.github.io/webappsec-csp/
[SRI]
Devdatta Akhawe; et al. Subresource Integrity. 23 June 2016. REC. URL: https://w3c.github.io/webappsec-subresource-integrity/