Search

Creating a File Content Crawler with ColdFusion

0 views

Step‑by‑Step: Crafting a PDF File Crawler with ColdFusion

When you need to locate specific file types across a web server’s file system - such as PDFs hidden deep in a complex folder structure - hand‑rolled scripts can provide a lightweight alternative to external indexing tools. ColdFusion’s built‑in tags, <cfdirectory> and <cfloop>, let you explore directories recursively and collect matching file names with minimal effort. In this guide we’ll build a simple yet effective crawler that starts from a root folder, walks through every sub‑folder, and gathers all PDF files into a single list. You’ll also see how to keep the list unique and count the results so you can report on how many items you found.

The crawler is designed to run locally on your development machine or on a server where the file system is accessible to the CFML engine. Because it uses standard CFML tags, the same logic will work on any ColdFusion or Lucee installation, provided the account running the script has read permissions on the target directories. If you’re using a Windows system, the path syntax will look like D:\websites; on Unix‑based servers, replace it with something like /var/www/websites. The following example will use the Windows style for clarity.

First, let’s outline the main data structures the crawler will manipulate. We’ll keep a list of directories that still need to be processed, a list of PDFs that have been discovered, and a simple counter. The code will loop until there are no directories left to scan. Because ColdFusion handles lists as strings separated by a delimiter (default is a comma), we’ll use the pipe character | to avoid confusion with file paths that contain commas. By treating directories and file names as list entries, the loop logic becomes straightforward and easy to read.

Below is the complete script, interspersed with explanatory comments. Copy it into a new CFML page, adjust the rootFolder variable to point to your own directory, and run it. After a short run you’ll see the crawler print every PDF file it finds, along with a final count.

Prompt
<!--- 1. Configuration: root folder and file extension to search ---></p> <p><cfset rootFolder = "D:\websites"></p> <p><cfset fileExtension = "pdf"></p> <p><!--- 2. State variables: lists and counter ---></p> <p><cfset dirsToCrawl = rootFolder> <!--- initially contains only the root ---></p> <p><cfset pdfFiles = ""> <!--- collected PDFs will be appended here ---></p> <p><cfset fileCount = 0></p> <p><!--- 3. Crawl loop: continue while there are directories left to process ---></p> <p><cfloop condition="len(dirsToCrawl) > 0"></p> <p> <!--- Pull the first directory off the list and prepare the next list ---></p> <p> <cfset currentDir = ListFirst(dirsToCrawl, "|")></p> <p> <cfset dirsToCrawl = ListDeleteAt(dirsToCrawl, 1, "|")></p> <p> <!--- Get the contents of the current directory ---></p> <p> <cfdirectory action="LIST" directory="#currentDir#" name="dirContents"></p> <p> <!--- Process each item in the directory ---></p> <p> <cfloop query="dirContents"></p> <p> <!--- Skip the special entries that represent the current and parent directories ---></p> <p> <cfif name EQ "." OR name EQ ".."></p> <p> <cfcontinue></p> <p> </cfif></p> <p> <!--- If the item is a folder, add it to the list of directories to crawl later ---></p> <p> <cfif type EQ "dir"></p> <p> <cfset dirsToCrawl = ListAppend(dirsToCrawl, currentDir & "\" & name, "|")></p> <p> <!--- If the item is a file, check its extension ---></p> <p> <cfelseif type EQ "file"></p> <p> <cfif LCase(ListLast(name, ".")) EQ fileExtension></p> <p> <!--- Build the full path of the file ---></p> <p> <cfset fullPath = currentDir & "\" & name></p> <p> <!--- Avoid duplicates: only add if not already present ---></p> <p> <cfif NOT ListFind(pdfFiles, fullPath, "|")></p> <p> <cfset pdfFiles = ListAppend(pdfFiles, fullPath, "|")></p> <p> <cfset fileCount = fileCount + 1></p> <p> </cfif></p> <p> </cfif></p> <p> </cfloop></p> <p></cfloop></p> <p><!--- Output the results in an ordered list ---></p> <p><cfoutput></p> <p> <hr></p> <p> <h3>Found PDF Files:</h3></p> <p> <ol></p> <p> <cfloop list="#pdfFiles#" index="file" delimiters="|"></p> <p> <li>#file#</li></p> <p> </cfloop></p> <p> </ol></p> <p> <hr></p> <p> <p>Total PDFs found: #fileCount#</p></p> <p></cfoutput>

Running the script will produce a neat ordered list of each PDF file discovered, with a total count at the bottom. Because the crawler uses a breadth‑first traversal, it processes sibling folders before moving deeper, which keeps memory usage predictable even when the directory tree is very large. If you need a depth‑first approach instead, simply replace the ListAppend logic with ListPrepend so new directories are handled immediately.

When you’re satisfied with the list of files, you might want to copy or archive them elsewhere. The pdfFiles list can be passed to <cffile action="copy"> in a second loop, or you could feed it into a database table for further analysis. The current script stops after the crawl, but extending it to include such actions is trivial once you have the list in place.

One more tip: if your server runs multiple instances of the script simultaneously, lock the dirsToCrawl variable with <cfthread> or a database lock to prevent race conditions. For most single‑user development scenarios, the simple loop shown above is sufficient.

For anyone looking to stay on top of the latest ColdFusion tutorials, EasyCFM, which offers new tutorials each week from the site owner, Pablo Varando, and community contributors. EasyCFM is hosted by

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Related Articles