Creating a File Content Crawler with ColdFusion

Step‑by‑Step: Crafting a PDF File Crawler with ColdFusion

When you need to locate specific file types across a web server’s file system - such as PDFs hidden deep in a complex folder structure - hand‑rolled scripts can provide a lightweight alternative to external indexing tools. ColdFusion’s built‑in tags, <cfdirectory> and <cfloop>, let you explore directories recursively and collect matching file names with minimal effort. In this guide we’ll build a simple yet effective crawler that starts from a root folder, walks through every sub‑folder, and gathers all PDF files into a single list. You’ll also see how to keep the list unique and count the results so you can report on how many items you found.

The crawler is designed to run locally on your development machine or on a server where the file system is accessible to the CFML engine. Because it uses standard CFML tags, the same logic will work on any ColdFusion or Lucee installation, provided the account running the script has read permissions on the target directories. If you’re using a Windows system, the path syntax will look like D:\websites; on Unix‑based servers, replace it with something like /var/www/websites. The following example will use the Windows style for clarity.

First, let’s outline the main data structures the crawler will manipulate. We’ll keep a list of directories that still need to be processed, a list of PDFs that have been discovered, and a simple counter. The code will loop until there are no directories left to scan. Because ColdFusion handles lists as strings separated by a delimiter (default is a comma), we’ll use the pipe character | to avoid confusion with file paths that contain commas. By treating directories and file names as list entries, the loop logic becomes straightforward and easy to read.

Below is the complete script, interspersed with explanatory comments. Copy it into a new CFML page, adjust the rootFolder variable to point to your own directory, and run it. After a short run you’ll see the crawler print every PDF file it finds, along with a final count.

Prompt

<cfset rootFolder = "D:\websites"> <cfset fileExtension = "pdf">  <cfset dirsToCrawl = rootFolder>  <cfset pdfFiles = "">  <cfset fileCount = 0>  <cfloop condition="len(dirsToCrawl) > 0">  <cfset currentDir = ListFirst(dirsToCrawl, "|")> <cfset dirsToCrawl = ListDeleteAt(dirsToCrawl, 1, "|")>  <cfdirectory action="LIST" directory="#currentDir#" name="dirContents">  <cfloop query="dirContents">  <cfif name EQ "." OR name EQ ".."> <cfcontinue> </cfif>  <cfif type EQ "dir"> <cfset dirsToCrawl = ListAppend(dirsToCrawl, currentDir & "\" & name, "|")>  <cfelseif type EQ "file"> <cfif LCase(ListLast(name, ".")) EQ fileExtension>  <cfset fullPath = currentDir & "\" & name>  <cfif NOT ListFind(pdfFiles, fullPath, "|")> <cfset pdfFiles = ListAppend(pdfFiles, fullPath, "|")> <cfset fileCount = fileCount + 1> </cfif> </cfif> </cfloop> </cfloop>  <cfoutput> <hr> <h3>Found PDF Files:</h3> <ol> <cfloop list="#pdfFiles#" index="file" delimiters="|"> <li>#file#</li> </cfloop> </ol> <hr> Total PDFs found: #fileCount# </cfoutput>

Running the script will produce a neat ordered list of each PDF file discovered, with a total count at the bottom. Because the crawler uses a breadth‑first traversal, it processes sibling folders before moving deeper, which keeps memory usage predictable even when the directory tree is very large. If you need a depth‑first approach instead, simply replace the ListAppend logic with ListPrepend so new directories are handled immediately.

When you’re satisfied with the list of files, you might want to copy or archive them elsewhere. The pdfFiles list can be passed to <cffile action="copy"> in a second loop, or you could feed it into a database table for further analysis. The current script stops after the crawl, but extending it to include such actions is trivial once you have the list in place.

One more tip: if your server runs multiple instances of the script simultaneously, lock the dirsToCrawl variable with <cfthread> or a database lock to prevent race conditions. For most single‑user development scenarios, the simple loop shown above is sufficient.

For anyone looking to stay on top of the latest ColdFusion tutorials, EasyCFM, which offers new tutorials each week from the site owner, Pablo Varando, and community contributors. EasyCFM is hosted by