This is the fourth part of a five part article. Read the third part
Figure 1: Cache Usage on the Web
This diagram illustrates a key point in our discussion: caches are found on the Web in many places and are constantly trying to hold your site content whenever possible. While it is easy to remain ignorant, allowing them to dictate caching behavior, from the standpoint of site performance, it is vital to purposefully engage them, dictating which objects should or should not be cached, and for how long.
Freshness and Validation
In order to make the best use of any cache, including effectively using a browser cache, we need to provide some indication of when a resource is no longer valid and should therefore be reacquired. More specifically, we need the ability to indicate caching rules for Web page objects, ranging from setting appropriate expiration times to indicating when a particular object should not be cached at all. Fortunately, we have all of these tools at our disposal in the form of HTTP cache controls rules.
The key to cache awareness lies in understanding the two concepts that govern how caches behave: freshness and validation. Freshness refers to whether or not a cached object is up-to-date, or in more technical terms, whether or not a cached resource is in the same state as that same resource on the origin server. If the browser or other Web cache lacks sufficient information to confirm that a cached object is fresh, it will always err on the side of caution and treat it as possibly out-of-date or stale. Validation is the process by which a cache checks with the origin server to see whether one of those potentially stale cached object is fresh or not. If the server confirms that the cached object is still fresh, the browser will use the local resource; if not, a fresh copy must be served.
A Basic Example of Caching
The concepts of freshness and validation are best illustrated with an example (in this case using a browser cache, but the core principles hold true to public caches as well):
Step 1: A remote site contains a page called page1.html. This page references image1.gif, image2.gif, and image3.gif and has a link to page2.html. When we access this page for the first time, the HTML and the associated GIF images are downloaded one-by-one and stored in the local browser cache.
Step 2: The user follows the link to page2.html, which has never been visited before and which references image1.gif, image3.gif, and image4.gif. In this case, the browser downloads the markup for the new page but the question is: should it re-download image1.gif and image3.gif even though it already has them cached? The obvious answer would be no, but, how can we be sure that the images have not changed since we downloaded page1.html? Without cache control information, the truth is that we can't. Therefore, the browser would need to revalidate the image by sending a request to the server in order to check if each image has been modified. If it has not been changed, the server will send a quick 304 Not Modified response that instructs the browser to go ahead and use the cached image. But, if it has been modified, a fresh copy of the image will have to be downloaded. This common request-and-response cycle is shown here:
Crafting Cache Control Policies
Minimizing round trips over the Web to revalidate cached items can make a huge difference in browser page load times. Perhaps the most dramatic illustration of this occurs when a user returns to a site for the second time, after an initial browser session. In this case, all page objects will have to be revalidated, each costing valuable fractions of a second (not to mention consuming bandwidth and server cycles). On the other hand, utilizing proper cache control allows each of these previously viewed objects to be served directly out of the browser's cache without going back to the server. The effect of adding cache control rules to page objects is often visible at page load time, even with a high bandwidth connection, and users may note that your sites appear to paint faster and that "flashing" is reduced between subsequent page loads. Besides improved user perception, the Web server will be offloaded from responding to cache revalidation requests, and thus will be able to better serve new traffic.
However, in order to enjoy the benefits of caching, a developer needs to take time to write out a set of carefully crafted cache control policies that categorize a site's objects according to their intended lifetimes. Here is an example of a complete set of cache control policies for a simple e-commerce Web site:
Taking Charge of Caching
There are three methods of setting cache control rules for the Web:
Specify cache control headers via a <meta> tag
Set HTTP headers programmatically
Set HTTP headers through Web server settings
Each of these approaches has both pros and cons that we briefly summarize.
<meta> Tags for Basic Caching
The simplest way to implement cache control is to use the tag. For example, we could set the Expires header to sometime in the future:
<meta HTTP-EQUIV="Expires" content="Sun, 31 Oct 2004 23:59:00 GMT" />
In this case, a browser parsing this HTML will assume that this page does not expire until October 2004 and will add it to its cache. Because the page will be stamped with this Expires header, the browser won't re-request the page until after this date or until the user modifies the browser's caching preferences or clears the cache manually.
Of course, while it is often advantageous to cache page data, as we mentioned above there are instances when you would not want to cache data at all. In this case, you might set the Expires value to be sometime in the past:
You might be concerned about clock variations on the user's system and therefore set the date far in the past, but in reality, this is rarely an issue since it is the server's Date response header response that matters for cache control.
The use of Expires with a past date in a tag should work for both HTTP 1.0- and 1.1-compliant browsers. There are two more tags that are often used to make sure that a page is not cached. The Pragma tag is used to talk to HTTP 1.0 browser caches, while the Cache-Control tag is used for HTTP 1.1 clients. It never hurts to include both of these if you want to make sure that a page is never cached, regardless of browser type or version:
<meta HTTP-EQUIV="Pragma" content="no-cache" />
<meta HTTP-EQUIV="Cache-Control" content="no-cache" />
As easy as <meta> tags might appear, they suffer from one major problem - they are not able to be read by intermediary proxy caches, which generally do not parse HTML data, but instead rely directly on HTTP headers to control caching policy. Because of this lost potential value, and given the fact that browsers will readily use HTTP headers, <meta>-based cached control really should not be a developer's primary approach to cache control.
Programming Cache Control
Most server-side programming environments, such as PHP, ASP, and ColdFusion, allow you to add or modify the HTTP headers that accompany a particular response. To do this in ASP, for example, you would use properties of the built-in Response object by including code such as this at the top of your page:
<%
Response.Expires = "1440"
Response.CacheControl = "max-age=86400,private"
%>
Here you are asking ASP to create both an Expires header (for HTTP 1.0-compliant caches) and a Cache-Control header (for HTTP 1.1 caches). You are also specifying a freshness lifetime for this cached object of twenty-four hours (note that the Expires property requires a value in minutes while Cache-Control uses seconds). As a result, the following headers would be added to the HTTP response (assuming "now" is 8:46 PM on Friday, February 13th, 2004 Greenwich Mean Time):
Expires: Sat, 14 Feb 2004 20:46:04 GMT
Cache-control: max-age=86400,private
This and similar mechanisms in the other major server-side programming environments is a far more efficient way of communicating with caches than by relying on tags. So, when you have a choice between implementing cache-control policies using the element and doing so using a server-side programming environment like ASP, always choose the latter.
However, there is a different issue about which the server-side programming environment can do nothing. Imagine that the ASP file for which we created the above code links to several images that, according to your cache control policies, have freshness lifetimes of a full year. How would you implement the HTTP headers to tell caches that they can store those images for that long? You could try to use a server-side script to return the images programmatically, but this is both complex and wasteful. A better approach to setting cache control information for static externals like CSS, JavaScript, and binary objects, is by setting cache control information on the server itself.
Programming Cache Control
Both Microsoft IIS and Apache provide a variety of facilities for cache control. Unfortunately, each has a different approach to caching, and delegation of cache control policies is not cleanly in the hands of those who are most familiar with a site's resources - developers!
Apache Cache Control
When it comes to implementing cache control policies easily, users of the Apache Web server are somewhat better off than those running IIS, provided that the Apache module is installed.
With mod_expires, a server administrator can set expiration lifetimes for the different objects on a site in the main server configuration file (usually httpd.conf). As is often the case for Apache modules, both the Virtual Host and Directory containers can be used to specify different directives for different sites, or even for different directories within a given site. This is much more convenient than having to use IIS's graphical user interface or metabase scripting objects.
Even handier is mod_expires's ExpiresByType directive, which allows you to set the expiration lifetime for all files of a given MIME type with a single line of code. This directive allows you to easily set cache policies for all the scripts, style sheets, or images in your site. Of course, you often need to craft more fine-grained policies based upon object type or directory. In this case, the settings specified in the primary configuration file can be overridden (at the server administrator's discretion) by directives in a .htaccess file for a given directory and its children. In this way, developers can write and maintain their own cache control directives without requiring administrative access to the server, even in a shared hosting environment.
If your Apache server was not built with mod_expires, the best way to enable it is to build it as a shared object (using aspx is generally easiest) and to then include the following line in your httpd.conf file:
LoadModule expires_module modules/mod_expires.so
As mentioned above, you can then put your configuration directives right into httpd.conf. However, many administrators will want to locate these in an external configuration file to keep things neat. We will follow this practice in our example by using Apache's Include directive in httpd.conf (the IfModule container is an optional, but traditional, safety measure):
<IfModule mod_expires.c≶
Include conf/expires.conf
</IfModule≶
We can now locate the directives that control how mod_expires behaves in a module-specific configuration file called expires.conf. Here is a sample set of such directives:
ExpiresActive On
ExpiresDefault "access 1 month"
ExpiresByType image/png "access 3 months"
The ExpiresActive directive simply enables mod_expires, while the ExpiresDefault directive sets up a default expiration lifetime that will be used to create Expires and Cache-Control headers for any files that don't have a more specific rule applying to them. Note the syntax used for specifying the expiration lifetime; the time unit can be anything from seconds to years and the base time can be specified as modification, as well as access.
Next is the very useful ExpiresByType directive mentioned earlier, here applied to all .png image files on the server:
<Directory "/usr/local/apache/htdocs/static">
AllowOverride Indexes
ExpiresDefault "access 6 months"
</Directory>
Finally we have a Directory container that overrides all other rules for anything in the directory /static. This directory has its own ExpiresDefault directive, as well as an AllowOverride directive that allows the settings for itself and its children to be overridden by means of .htaccess files. The .htaccess file, in turn, might look like this:
ExpiresByType text/html "access 1 week"
Note that this overrides any directives that would otherwise have applied to files in /static and its children that have the MIME type text/html. By using a combination of configuration file directives, and then overriding those directives by using .htaccess, almost any set of cache control policies, no matter how complex, can be easily implemented by both administrators and, if properly delegated, developers.
IIS Cache Control
If you are setting cache control rules on Microsoft's Internet Information Service (IIS), you need access to the IIS Metabase, which is typically accessed via the Internet Service Manager (ISM), the Microsoft management console application that controls IIS administrative settings.
To set an expiration time in IIS, simply open the ISM, bring up the property sheet for the file or directory whose expiration time you want to configure, and click on the "HTTP Headers" tab. Next, put a check in the box labeled "Enable Content Expiration" and, using the radio buttons, choose one of the options provided. You can choose to expire the content immediately, set a relative expiration time (in minutes, hours, or days), or set an absolute expiration time. Note that both Expires and Cache-Control headers will be inserted into the affected responses. The basic idea is shown here.
CacheRight from Port80 Software.
Modeled after mod_expires, CacheRight creates a single, text-based rules file that lives in each Web site's document root that allows both administrators and developers to set expiration directives for an entire site. In addition, CacheRight goes beyond mod_expires by adding an ExpiresByPath directive to complement the ExpiresByType directive. Armed with this functionality, it becomes trivially easy to set both a general cache control policy for files of a given type and to then override that rule with a more specific one for a subset of files of that type, as in this example:
ExpiresByType image/* : 6 months after access public
ExpiresByPath /navimgs/*, /logos/* : 1 year after modification public
Here, all images will have a freshness lifetime of six months, except for those located in the navimgs and logos directories. Like mod_expires, CacheRight lets you set the expiration times relative to the modification time of the file(s), as well as to the user's first access. This flexibility can be very useful when publication or update schedules are not set in stone, which is, of course, all too common in Web development.
Regardless of which server you use, it is well worth the time to figure out how to manage cache control hereat the server level. As with programmatic cache control, the directives will be respected by well-behaved intermediary caches and not just by browser caches, as in the case of the tag approach. Furthermore, unlike with programmatic or tag-based cache control, it is very easy using server-based cache control to set caching policies for all the really heavy objects in a site, such as images or Flash files. It is in the caching of these objects where performance gains are most obvious.
The Benefits of Caching
While we have only scratched the surface of the complex topic of caching, hopefully we have shed some light on this under-appreciated facet of Web site performance. In particular, we hope that you better understand why it is vital to have a set of cache control policies for your site and that you have gained some ideas about how to go about effectively implementing those policies. The results of properly applied cache rules will be obvious - dramatically faster page loads for your end users, especially repeat visitors. In addition, you will be making more efficient use of your bandwidth and your server resources. The fact that all these enhancements can be achieved through little more than a heightened attention to HTTP headers, and perhaps minimal software investments, makes effective, expiration-based cache-control one of the most cost-effective performance optimizations you will ever make to your site.
To be continued...
*Originally published at port80software.com.
Developing Your Site for Performance : Optimal Cache Control
0 views
Comments (0)
Please sign in to leave a comment.





No comments yet. Be the first to comment!