-
Notifications
You must be signed in to change notification settings - Fork 158
Design Doc: Resource Naming
Joshua Marantz, October 2010
Prior to launch in November 2010, Instaweb/Apache's rewritten resources were encoded as:
URL_PREFIX/FILTER.HASH.ENCODING.EXT
URL_PREFIX
is typically something like http://yourhost.com/instaweb
or, if sharding domains, http://shard%d.yourhost.com/instaweb
.
FILTER
is a two-letter code: ic
=image compression, cc
=combine css, jm
=javascript minification, ce
=cache extension, etc.
HASH
is an md5 or sha1 hash of the resource content (excluding headers).
ENCODING
is a filter-specific representation of a formula for reconstructing the output resource from the URL. Depending on settings in the server, we may leave this empty, to reduce URL size when we have reliable local storage on a single server, or a network-accessible database reachable by all servers. We are in the process of migrating some of our filters from using an encoding based on base64(gzip(serialize(protobuf)))
, but are migrating to a simpler url escaping strategy for most filters.
EXT
is the computed filename extension, e.g. "js" or "css". It may not be the same as the extension of the origin resource, particularly if the origin resource was a lie (e.g. it was really a png file but it had a .gif extension).
The old scheme had a number of drawbacks that are mounting in painfulness.
-
Security -- moving resources from one domain to another exposes our users to XSS hacks. In particular, a complex web of stupidity from browsers to SOP in Java applets creates a risk of adding security holes by putting resources together that were formerly on separate domains.
-
Correctness -- various bugs where resources with relative paths are loaded from different URLs than their author intended.
-
Ease of use -- the current encoding makes resources show up in firebug in a format that makes it inconvenient to discover what they are. This makes it harder to debug Instaweb and will also make it harder for site owners to understand what it is doing.
-
URL bloat -- we are encoding origin URLs having to avoid using characters '.' or '/', which means there's a lot of escaping. Further, we use ',' as an escape character, which is also an escape for the filename encoding. So we make very long filenames with this strategy.
All of these issues can be addressed individually. But we propose here a holistic strategy to address all these issues.
In the new strategy, we would never move resources between domains (unless those domains were declared equivalent via a directive or command-line option). And we'd minimize moving resources across path hierarchies even within a domain -- we'd do that only when combining multiple resources. We'd instead generate rewritten resources by changing only the leaf names.
Here's an example. A web page:
http://www.mysite.com/index.html
loads
css1.css
styles/css2.css
http://www.mysite.com/css3.css
http://mysitestatic.com/styles/css4.css
By default we'd combine css1, css2, and css3, but leave css4 separate. We'd have to update relative references in css2.css. The new resource would be:
http://www.mysite.com/ENCODING("css1.css","styles/css2.css","css3.css").cc.HASH.CHECKSUM.css
Here, ENCODING can use the relative paths from http://www.mysite.com. The encoding will be left in a human-readable form, balanced against a desire to minimize the URL size. With the paths factored out, there is no longer substantial opportunity for gzip to reduce URL size. Also, our encoding scheme already compresses common suffixes like .css and .jpeg.
The human-readable encoding comes first (leftmost) so that both Firebug and Chrome Developer Tools have useful resource views. The current resource view shows many lines like "ic.poiajsdpf89a9s709879aijs9d08u9..." where the rest of the line gets cut off in the firebug display. Putting the encoded resource name first has these advantages:
- We see what the contents are from in a way that will make sense to the web site
- allow us to leave '.' unescaped inside the URLs by matching the other fields from the right. This will shrink the URLs and leave them more legible, without having to hijack yet another separator that would then need to be escaped.
In this example, css2.css would need to have corrected URL references. For exapmle, if it referenced a background image as url('background.png') it would have be updated to url('styles/background.png'), as the referring css would now be a level up in the hierarchy.
We will write resources into the highest level path.
In the example above, we could not combine resources from http://www.mysite.com
and http://mysitestatic.com
. However it's possible that these are two names for the same physical server. Or that http://www.mysite.com
is served on-site, whereas http://mysitestatic.com
is served by a CDN. We'd like to have all the cacheable content come from a CDN, perhaps without burdening the site developers with messing with the HTML to achieve that.
Instaweb should therefore be configured to allow equivalence sets. E.g. InstawebApache could allow a configuration parameter:
InstawebDomain http://mysitestatic.com http://www.mysite.com...
With this option, we'd map URLs from http://www.mysite.com
to http://mysitestatic.com
. We could support arbitrary numbers of equivalence classes for domain. For example:
InstawebDomain a b c
InstawebDomain d e
This would cause us to map resources from domains b and c to domain a. We'd also map domain 'e' to domain 'd'.
We would not rewrite resources from any other domain, unless directed to do so by another option InstawebRewriteAllDomains, which we would set when we were using a proxy for debugging and exploration. This would effectively implement the "whitelist" policy for resource rewriting that we've discussed.
This feature would help us implement the pagespeed rule "serve static content from cookieless domains", under control of the site owner.
We would also allow the "%d" syntax in the first argument to InstawebDomain.
-
Security: in this scheme we would not move any resource unless explicitly instructed to do so via an "InstawebDomain" directive. This puts security in control of the site adminstrator. In fact he could make his code more secure without editing HTML code by moving resources off of his authenticated domain, via the InstawebDomain directive.
-
Correctness: we will not be changing the location of any resources, from a URL perspective. The only exception would be to move CSS files in a subdirectory up to be next to their brothers. We could determine with static analysis whether it is safe to move javascript files up.
-
Ease of use -- the (mostly) human-readable encoding will be the leftmost part of the leaf of each resource. The location of the URL will be unchanged (modulo domain equivalence classes).
-
URL bloat -- by rewriting resources without changing their domain or location, we can omit the common prefix and give references only to the leaves, or the relative paths to the leaves. By leaving '.' and right-matching the rest of the URL pattern, we can avoid having to escape it.
The current resource pattern knowledge is mostly encompassed in these methods:
void ResourceManager::SetUrlPrefixPattern(const StringPiece& pattern);
GoogleString ResourceManager::GenerateUrl(const StringPiece& name) const;
const char* ResourceManager::SplitUrl(const char* url, int* shard) const;
plus the glue to parse the url prefix from command-line options and apache directives and send it through the class hierarchy.
We'd replace the url_prefix directives and command-line options with ones to add new domain equivalence classes.
Moving Javascript across domains or paths is hard, because it requires some solution to correct relative references in the .js code. Moving CSS is easier; it requires absolutifying URLs, but they are generally easy to find. However, CSS can contain embedded javascript, which makes means proxying it from an untrusted domain opens a potential security hole if moved onto a domain with auth cookies. Moving images is functionally straightforward, as they typically do not have references to other URLs. However, they may contain hidden JARs or other security bombs. Assuming, for the moment, that this attack can be thwarted in some way, say, by non-deterministic image transformations, it may be possible to proxy and optimize images from other domains. However, transforming the images might violate the terms of service of the image owner.
So for the moment, we will not pursue moving images, javascript, or css across domains, except via explicit InstawebDomain directive.
The naming scheme as of the launch was acceptable from a security and functional perspective. But it was not friendly to site owners or to us. Resource names looked quite mangled, and it was hard to read where they came from. We introduced a new naming syntax which persists to today (Oct 2013), manifesting in mod_pagespeed and ngx_pagespeed, and we have no current plans to change it. The syntax is:
ENCODED_NAME.pagespeed.FILTER_ID[OPTIONS].HASH.COMPUTE_EXTENSION
The optional "[OPTIONS]" section was introduced to support embedding image-optimization settings in URLs to facilitate models where multiple HTML domains can be configured with different image settings, but can map resources to a common image-serving domain.