For many years we have heard about the impending death of URLs that are difficult to type, remember and preserve. The use of URLs has actually improved little thus far, but changes are afoot in both development practices and Web server technology that should help advance URLs to the next generation. Dirty URLs Complex, hard-to-read URLs are often dubbed dirty URLs because they tend to be littered with punctuation and identifiers that are at best irrelevant to the ordinary user. URLs such as http://www.example.com/cgi-bin/gen.pl?id=4&view=basic are commonplace in today's dynamic Web. Unfortunately, dirty URLs have a variety of troubling aspects, including:
- Dirty URLs are difficult to type.
- Dirty URLs do not promote usability.
- Dirty URLs are a security risk.
- Dirty URLs impede abstraction and maintainability. Because dirty URLs generally expose the technology used (via the file extension) and the parameters used (via the query string), they do not promote abstraction. Instead of hiding such implementation details, dirty URLs expose the underlying "wiring" of a site. As a result, changing from one technology to another is a difficult and painful process filled with the potential for broken links and numerous required redirects. Why Use Dirty URLs? Given the numerous problems with dirty URLs, one might wonder why they are used at all. The most obvious reason is simply convention -- using them has been, and so far still is, an accepted practice in Web development. This fact aside, dirty URLs do have a few real benefits, including:
- They are portable.
- They can discourage unwanted reuse. The negative aspects of a dirty URL can be regarded as positive when the intent is to discourage the user from typing a URL, remembering it, or saving it as a bookmark. The intimidating look and length of a dirty URL can be a signal to both user and search engine to stay away from a page that is bound to change. This is often simply a welcome side effect, rather than a conscious access control policy -- frequently nothing is done to prevent actual use of the URL by means of session variables or referring URL checks. Cleaning URLs The disadvantages of dirty URLs far outweigh their advantages in most situations. If the last 30 or 40 years of software development history are any indication of where development for the Web is headed, abstraction and data hiding will inevitably increase as Web sites and applications continue to grow in complexity. Thus, Web developers should work toward cleaner URLs by using the following techniques:
- Keep them short and sweet.
- Avoid punctuation in file names.
- Use lower case and try to address case sensitivity issues. Given the last tip, you might instead name a file ProductSpecSheet.html. However, casing in URLs is troubling because depending on the Web server's operating system, file names and directories may or may not be case sensitive. For example, http://ww.xyz.com/Products.html and http://www.xyz.com/products.html are two different files on a UNIX system but the same file on a Windows system. Add to this the fact that www.xyz.com and WWW.XYZ.COM are always the same domain, and the potential for confusion becomes apparent. The best solution is to make all file and directory names lowercase by default and, in a case sensitive server operating environment, to ensure that URLs will be correctly processed no matter what casing is used. This is not easy to do under Apache on Unix/Linux systems (ww.amazon.com or gooogle.com and
- Support multiple domain forms. If an organization has many forms to its name, such as International Business Machines and IBM, it is wise to register both forms. Some companies will register their legal form as well, so XYZ, LLC or ABC, Inc. might register xyzllc.com and abcinc.com as well as primary domains. While it seems like a significant investment, if you use one of the new breed of low-cost registrars (like http://www.microsoft.com/word. Mapping multiple URLs to common guessable site entry points is fairly easy to do. Many sites have already begun to create a variety of synonym URLs for sections. For example, to access the careers section of the site, the canonical URL might be http://www.xyz.com/careers. However, adding in URLs like http://www.xyz.com/career, http://www.xyz.com/jobs, or http://www.xyz.com/hr is easy and vastly improves the chances that the user will hit the target. You could even go so far as to add hostname remapping so that http://investor.xyz.com, http://ir.xyz.com, http://investors.xyz.com, and so on all go to http://www.xyz.com/investor. The effort made to think about URLs in this fashion not only improves their usability, but should also promote long term maintainability by encouraging the modularization of site information.
- Where possible, remove query strings by pre-generating dynamic pages.
- Rewrite query strings. In the cases where pages should be dynamic, it is still possible to clean up their query strings. Simple cleaning usually remaps the ?, &, and + symbols in a URL to more readily typeable characters. Thus, a URL like http://www.xyz.com/presssearch.asp?key=New+Robot&year=2003&view=print might become something like http://www.xyz.com/pressearch.asp/key/New-Robot/year/2003/view/print. While this makes the page "look" static, it is indeed still dynamic. The look of the URL is a little less intimidating to users and may be more search engine friendly as well (search engines have been known to halt at the ? character). In conjunction with the next tip, this might even discourage URL parameter manipulation by potential site hackers who can't tell the difference between a dynamic page and a static one. The challenge with URL rewriting is that it takes some significant planning to do well, and the primary tools used for these purposes -- rule-based URL rewriters like ISAPI Rewrite for IIS -- have daunting rule syntax for developers unseasoned in the use of regular expressions. However, the effort to learn how to use these tools properly is well worth it.
- Remove extensions from files in URL and source. Probably the most interesting URL improvement that can be made involves the concept of content negotiation. mod_negotiation for Apache or w3compiler also are being developed to improve page preparation for negotiation and transmission. One word of assurance: don't jump to the conclusion that your files won't be named page.html anymore. Remember that, on your server, the precious extensions are safe and sound. Content negotiation only means that the extensions disappear from source code, markup, and typed URLs.
- Automatically spell check directory and file names entered by users. The last tip is probably the least useful, but it is the easiest to do: spell check your file and directory names. On the off chance that a user spells a file name wrong, makes a typo in extension or path, or encounters a broken link, recovery is easy enough with a spelling check. Given that the typo will start to generate a 404 in the server, a spelling module can jump in and try to match the file or directory name most likely typed. If file and directory names are relatively unique in a site, this last ditch effort can match correctly for numerous typos. If not, you get the 404 as expected. Creating simple "Did you mean X?"-style URLs requires the simple installation of a server filter like URLSpellCheck for IIS. The performance hit is not an issue, given that the correction filter is only called upon a 404 error, and it is better to result in a proper page than serve a 404 to save a minor amount of performance on your error page delivery. In short, there is no reason this shouldn't be done, and it is surprising that this feature is not built-in to all modern Web servers. Conclusions Most of the tips presented here are fairly straightforward, with the partial exception of URL cleaning and rewriting. All of them can be accomplished with a reasonable amount of effort. The result of this effort should be cleaned URLs that are short, understandable, permanent, and devoid of implementation details. This should significantly improve the usability, maintainability and security of a Web site. The potential objections that developers and administrators might have against next generation URLs will probably have to do with any performance problems they might encounter using server filters to implement them or issues involving search engine compatibility. As to the former, many of the required technologies are quite mature in the Apache world, and their newer IIS equivalents are usually explicitly modeled on the Apache exemplars, so that bodes well. As to the search engine concerns, fortunately, Google so far has not shown any issue at all with cleaned URLs. At this point, the main thing standing in the way of the adoption of next generation URLs is the simple fact that so few developers know they are possible, while some who do are too comfortable with the status quo to explore them in earnest. This is a pity, because while these improved URLs may not be the Cool URIs don't change
- Jakob Nielsen on URLs! URLs! URLs!
- Making "clean" URLs with Apache and PHP
- Search Engine Friendly URLs with PHP and Apache: Part II Apache Tools For Apache, nearly all modules can be found at modrewrite.com. A good overview of content negotiation on Apache can be found at iismodules.com lists many commercially and freely available modules, and iisfaq.com, and URLSpellCheck, PageXchanger, and port80software.com.





No comments yet. Be the first to comment!