Introduction
The Compound File Binary Format, abbreviated CFF, is a structured storage format that has been integral to Microsoft software for over three decades. It serves as the underlying container for many proprietary file types, most notably the older Microsoft Office documents with extensions such as .doc, .xls, and .ppt. CFF allows multiple independent streams of binary data to coexist within a single file, providing a hierarchical directory structure and efficient allocation of disk space. The format was originally developed to address the limitations of flat file storage and to support the Object Linking and Embedding (OLE) technology that enabled embedding of documents within other documents. Over time, CFF has become a standard for applications that require complex data organization without resorting to external databases.
Despite the advent of newer packaging formats such as the Open Packaging Convention (OPC) and the use of ZIP archives for Office Open XML files, CFF remains in widespread use. Many legacy systems, enterprise workflows, and file-processing libraries depend on its robustness and compatibility. The format’s persistence also stems from its ability to store large binary objects efficiently while maintaining backward compatibility with older applications that cannot read newer file types. As a result, CFF continues to be a focus of interest for developers, forensic analysts, and data recovery specialists.
History and Development
Early Windows Storage and OLE
Prior to the release of Windows 3.0, application data was typically stored in simple flat files, which limited the ability to embed complex structures. The introduction of the Structured Storage system in Windows 3.0 provided a mechanism for storing hierarchical data within a single file. This system was later refined in Windows 3.1 as part of the OLE (Object Linking and Embedding) framework. The Structured Storage concept was formalized in the Windows API as the Compound File Binary Format, which became the foundation for storing multiple streams and tables in a single file.
When Microsoft Office 97 was released, it adopted CFF as the core file format for its documents. The format’s design allowed Office to embed fonts, images, and other auxiliary objects directly into the document file. This capability facilitated features such as OLE objects, embedded spreadsheets, and other rich media. Over the years, Microsoft extended CFF to support additional features such as the mini-stream for storing small objects and the differential sector allocation table (DIFAT) for managing sector allocation across larger files.
Evolution Through Office Versions
Office 2007 marked a significant shift in file formats with the introduction of Office Open XML (OOXML), which relies on the Open Packaging Convention and ZIP archives. Nevertheless, Microsoft continued to support the older CFF-based formats for backward compatibility. The Office suite maintained dual support for .doc (CFF) and .docx (OPC) until the release of Office 2016, when CFF formats were phased out in favor of XML-based formats for most Office products. Despite this shift, many legacy applications still rely on CFF, particularly in industries where document archival and long-term storage are critical.
Throughout its lifespan, CFF has remained a stable platform for developers due to its well-documented specification. Microsoft released official documentation detailing the format’s architecture, sector allocation, and stream handling. Several third-party libraries emerged, such as the Apache POI project for Java and the OpenMCDF library for .NET, providing APIs to read and write CFF files. The open-source community has also contributed numerous utilities for manipulating CFF streams, enabling tasks such as file repair, data extraction, and forensic analysis.
File Structure and Architecture
Sector Allocation and the FAT
CFF files are divided into fixed-size sectors, typically 512 or 4096 bytes depending on the file size. The first sector of the file contains the compound file header, which holds metadata such as the sector size, the number of sectors, and offsets to critical structures. Following the header, the file contains a chain of sector allocation tables (SAT), which function similarly to the File Allocation Table in FAT file systems. Each entry in the SAT points to the next sector in a stream or marks the end of a stream with a special value.
The SAT allows CFF to manage fragmented storage efficiently. When a stream grows beyond the capacity of a single sector, additional sectors are allocated and linked through the SAT. The structure supports both contiguous and non-contiguous storage, thereby accommodating varying file sizes and minimizing fragmentation. In large CFF files, a differential sector allocation table (DIFAT) is employed to offset the size limitations of the SAT, enabling the format to scale beyond a few gigabytes.
Directory Entries and the Mini-Stream
Every stream or storage object in a CFF file is represented by a directory entry located in the directory sector. A directory entry contains information such as the object’s name, type (stream or storage), size, and pointers to its property streams. Directory entries are organized in a tree structure, enabling hierarchical navigation similar to a file system. This tree is serialized into a series of directory sectors, each containing multiple entries.
Small streams, typically less than 4096 bytes, are stored in a dedicated mini-stream to reduce overhead. The mini-stream resides within a special stream called the "Mini Stream" and is accessed via a separate mini-sector allocation table (MSAT). When a stream is designated as small, its data is stored in the mini-stream, and its directory entry includes a pointer to the relevant mini-sector. This mechanism optimizes storage for frequently used small objects, such as document properties and style definitions.
Property Streams and Metadata
CFF files support property streams that provide extended metadata about the file and its components. The most common property stream is the Document Summary Information stream, which holds metadata such as author, title, subject, and keywords. Another key property stream is the Summary Information stream, containing additional document-level attributes. These streams are identified by standard CLSIDs (Class Identifiers) and are automatically parsed by applications that support the format.
Property streams follow a specific binary structure defined by the OLE standard. They consist of a header indicating the number of properties, followed by a list of property entries. Each property entry includes a property identifier, a data type, and the property value itself. This structure enables applications to store arbitrary metadata in a binary form while maintaining compatibility across different software versions.
Key Concepts and Terminology
Sector and Mini-Sector
A sector is the fundamental storage unit in CFF, with a default size of 512 bytes for files smaller than 4 MB and 4096 bytes for larger files. The sector size is specified in the compound file header. A mini-sector, on the other hand, is a smaller unit of storage, typically 64 bytes, used exclusively for the mini-stream. The choice between sectors and mini-sectors depends on the size of the stream and the overall file structure.
Stream and Storage
In the context of CFF, a stream is a contiguous sequence of bytes representing a specific piece of data, such as the document body or an embedded image. A storage object functions as a container that can hold multiple streams and other storage objects, creating a hierarchical structure akin to directories in a file system. Streams are identified by names and are accessible via directory entries.
File Allocation Table (FAT) and Differential FAT (DIFAT)
The FAT in CFF serves as a map that associates each sector with its successor in a stream. Entries in the FAT are either sector indices or special values indicating the end of a stream. The DIFAT extends the FAT’s capacity by providing additional references to FAT sectors, thereby enabling files larger than 4 GB. The DIFAT is stored as a series of entries at the beginning of the file, following the compound file header.
Directory Entry Types
Directory entries in CFF can be of several types: “Storage”, “Stream”, “Root Storage”, “Unknown”, or “Property Set”. Each type has specific attributes. For example, the “Root Storage” is the top-level container for all objects within the file, while “Property Set” entries correspond to metadata streams. The type determines how the entry is interpreted and accessed by applications.
Property Set Stream Format
Property set streams are structured as a series of property sets, each containing a set of properties. Each property set begins with a header specifying the number of properties and an offset table. The individual properties include identifiers, data types (e.g., VT_LPSTR, VT_LPWSTR, VT_I4), and the actual values. The format adheres to the OLE property set specification, ensuring interoperability among different platforms and languages.
Implementation and Parsing
Reading CFF Files
To read a CFF file, a parser first validates the compound file header by checking the signature bytes and confirming the sector size. It then constructs the FAT and, if present, the DIFAT to map out the allocation of sectors. After establishing the sector allocation, the parser traverses the directory sector chain, collecting directory entries and constructing an in-memory representation of the file’s hierarchy.
For each stream entry, the parser uses the FAT to locate all sectors belonging to that stream. When a stream is identified as a mini-stream, the parser retrieves the mini-sector chain via the MSAT and reads the mini-stream’s data. The parser also reads any property streams and interprets them according to the OLE property set format, enabling extraction of metadata.
Writing CFF Files
Writing a CFF file involves the reverse process. A writer begins by allocating sectors for the header, FAT, DIFAT, directory sectors, and stream data. It then constructs the FAT entries to map each sector to its successor. If a file exceeds the sector size threshold, the writer creates a DIFAT to reference additional FAT sectors.
When writing streams, the writer determines whether the stream qualifies as a mini-stream based on its size. If so, the data is packed into the mini-stream using the MSAT; otherwise, the stream data is stored in regular sectors. The writer updates the directory entries to reflect the new streams and properties, ensuring that all offsets and sizes are accurate.
Libraries and Tools
Apache POI (Java): Provides classes such as
HSSFWorkbookandHSSFSheetfor reading and writing older Office binary formats.OpenMCDF (.NET): An open-source library that offers a low-level API for manipulating CFF files, including directory and stream handling.
libmcdf (C): A lightweight library for accessing Compound Document Format files, suitable for embedded systems.
oletools (Python): A suite of utilities for extracting metadata and analyzing OLE Compound Documents.
VBA OLEObject methods: Microsoft Office’s own objects for working with embedded documents.
Common Challenges
Sector size mismatch: Some files use a 512-byte sector size, while others use 4096 bytes. A parser must adapt to the sector size specified in the header.
Fragmentation: Large streams may be scattered across noncontiguous sectors, requiring careful traversal of the FAT to reconstruct the data.
Corruption detection: Incomplete FAT entries or invalid sector chains can lead to file corruption. Robust parsers implement validation checks and error recovery routines.
Endianness: The format uses little-endian byte ordering, which must be respected when reading and writing numeric values.
Unicode handling: Property streams may contain UTF-16 strings; proper decoding is necessary to preserve text integrity.
Applications and Usage
Microsoft Office Legacy Formats
The most prominent use of CFF is within legacy Microsoft Office documents. The binary .doc, .xls, and .ppt files embed multiple streams containing text, graphics, and formatting information. Each document’s metadata, such as author and creation date, is stored in property set streams. CFF’s ability to embed large binary objects directly into a single file made it ideal for Office’s early document models.
Embedded Systems and Resource-Constrained Devices
Some embedded systems adopt CFF to store configuration data, firmware images, or other binary resources in a self-contained format. The hierarchical structure allows these devices to maintain multiple related configuration streams without external dependencies. Additionally, the mini-stream mechanism reduces storage overhead for small configuration values.
Forensic Analysis and Digital Forensics
Digital forensic investigators frequently encounter CFF files during evidence analysis. The format’s detailed metadata and embedded streams enable the extraction of crucial information, such as document authoring history and embedded OLE objects. Tools that parse CFF allow investigators to reconstruct deleted data and validate document integrity. Because CFF is used in many legacy systems, forensic expertise in this format remains valuable.
Archival and Data Preservation
Libraries, archives, and government agencies often preserve documents in their native CFF format to maintain authenticity and avoid format conversion errors. The self-contained nature of CFF simplifies archival by eliminating external dependencies. Long-term preservation efforts may involve migrating CFF files to newer formats, but many institutions retain the original binary files to preserve provenance.
Software Development and Testing
Developers working on office interoperability libraries or testing document conversion utilities must handle CFF files. By manipulating CFF streams directly, developers can craft test cases that exercise edge conditions such as extreme fragmentation or custom property sets. This capability is essential for ensuring compatibility across different Office versions.
Malware Analysis
Some malware variants embed malicious payloads within Office documents using the OLE Compound Document structure. Analysts parse CFF files to isolate malicious streams, such as embedded scripts or macro code. Understanding the format’s intricacies allows analysts to detect obfuscation techniques, such as stream encryption or unconventional stream ordering, that malware uses to evade detection.
Malware Use and Detection
Embedding Malicious Payloads
Malicious authors sometimes exploit the CFF format’s ability to contain arbitrary binary streams. By embedding malicious macro code or encrypted payloads within a .doc or .xls file, attackers can create seemingly innocuous documents that trigger execution when opened. The macro code may be hidden within a stream named “WordDocument” or “Workbook”, while the actual payload resides in a custom storage object.
Obfuscation Techniques
Stream Renaming: Changing stream names to nonstandard strings can confuse parsing tools that rely on standard names.
Sector Shuffling: Rearranging sectors to create a fragmented layout that bypasses simplistic parsers.
Invalid FAT Entries: Deliberately setting FAT entries to incorrect values, causing parsers to misinterpret the data and skip detection.
Encrypting Property Streams: Encrypting the property set streams can obscure the author or creation date, misleading provenance checks.
Custom CLSIDs: Using nonstandard CLSIDs for embedded objects, thereby evading standard OLE parsers that only handle known CLSIDs.
Detection and Prevention
Signature Scanning: Anti-malware scanners look for known signatures within CFF streams, such as “OfficeDocument” or specific macro patterns.
Property Verification: Tools check that metadata such as the author field matches expected values or contains suspicious macros.
Integrity Hashing: Calculating SHA-256 hashes of stream data allows detection of tampering or corruption.
Sandbox Analysis: Opening suspicious CFF files within a controlled environment to observe macro execution or embedded code.
Policy Enforcement: Email gateways may enforce policies that block documents with embedded OLE objects or macro-enabled streams.
Malware Use and Detection (Detailed)
Macro-Based Attacks
Macro-enabled Office documents often use CFF streams to store VBA macro code. The “WordDocument” or “Workbook” stream can contain a “VBAProject” stream, which includes compiled macro bytecode. Malware authors embed malicious macros that run when the document is opened, often exploiting the “AutoOpen” event. The macro may download additional payloads or modify system settings.
Embedded Scripts and Exploit Kits
Some malware leverages the OLE Object system to embed exploit scripts directly into the document’s streams. These scripts may target vulnerabilities in Office’s OLE parser or the underlying Windows API. By embedding the exploit within a CFF stream, attackers reduce the likelihood of detection by standard antivirus solutions that focus on executable files.
Polymorphic Documents
Polymorphic malware modifies the internal structure of the CFF file, such as changing stream names or reordering sectors, to evade signature-based detection. Analysts must rely on heuristic methods, like detecting abnormal FAT chains or irregular property values, to identify such variants.
Detection Strategies
Static Analysis: Examining the compound file header and property streams for anomalies, such as inconsistent sector sizes or unusual author values.
Dynamic Analysis: Executing the document in a sandboxed environment to monitor macro execution and detect network activity.
Signature Extraction: Computing unique signatures of the embedded streams (e.g., the “VBAProject” stream) and comparing them against known malicious patterns.
Heuristic Detection: Looking for patterns such as large numbers of custom property sets or streams with nonstandard names, which are common in malicious documents.
Sandboxing Policies: Configuring email gateways or document management systems to automatically disable macros in unknown documents.
Malware Detection Techniques Using CFF
Analyzing Macro Code
Using tools such as oletools::olevba or oletools::olevba.py, analysts can extract the VBA project stream from a CFF file. The macro code is typically stored in the VBAProject.bin stream. By disassembling the macro bytecode, analysts identify suspicious functions, such as Shell, CreateObject, or ExecuteVBAMacros, which may indicate malicious behavior.
Detecting Obfuscated Streams
Obfuscated streams may be identified by their nonstandard names or by the presence of high entropy in the stream data. Analysts compute the Shannon entropy of each stream; streams with entropy values close to 1.0 are likely to contain encrypted or compressed data. Further inspection may involve attempting to decompress or decrypt the data using known algorithms, such as AES or RC4.
Cross-Referencing Properties
Malware authors sometimes manipulate document properties to deceive forensic investigators. By cross-referencing the Summary Information stream with the document’s content, analysts can detect inconsistencies, such as a mismatch between the author field and the actual macro author. The oletools::oleid utility lists property set streams and their CLSIDs, aiding in property verification.
Automated Scanning
Enterprise security solutions often integrate automated scanners that parse CFF documents. These scanners look for known malicious macro signatures, flag unusual property sets, and report potential threats. The scanners may also enforce policies that disable macros in documents with certain property values.
Malware Use and Detection (Examples)
Macro Virus: “Trojan:Word/Office.Boundary”
This virus spreads through Word documents with malicious macros. The macro code is stored in the VBAProject stream of a CFF file. When the document is opened, the macro triggers a download of a secondary payload and modifies the system registry.
Trojan:Office/OneClick.A
Infected Excel files contain a malicious VBA macro in the Workbook stream that opens a command shell upon opening. The macro obfuscates its code by using the mini-stream mechanism to hide the actual instructions, increasing the likelihood of bypassing security scanners.
Trojan:Office/OfficePatcher
Utilizes a custom storage object within a CFF file to hide malicious code. The custom storage holds a compressed binary payload that is executed when the document is opened. Analysts must parse the storage hierarchy to locate the hidden payload.
Future Outlook and Modern Alternatives
Office Open XML (OOXML)
Since Office 2007, Microsoft introduced Office Open XML, a ZIP-based format that replaces CFF for newer documents. OOXML offers better support for modern features, such as XML-based styling and improved metadata handling. However, the OOXML format’s ZIP container is still a compound document format, though with a simpler structure.
OpenDocument Format (ODF)
OpenDocument, used by LibreOffice and Apache OpenOffice, provides an open XML-based alternative to CFF. Unlike CFF’s binary format, ODF stores data in ZIP containers with XML streams, improving portability and readability.
Legacy Format Support
Despite the adoption of newer formats, many organizations continue to use CFF for backward compatibility. Consequently, the demand for tools that read, write, and analyze CFF files persists. Software vendors often bundle CFF support into their product suites to maintain compatibility across multiple Office versions.
Research Directions
Automated CFF repair: Developing algorithms that automatically reconstruct corrupted sector chains.
Machine learning for malware detection: Using features derived from FAT patterns and property streams to train classifiers that detect malicious documents.
Cross-platform interoperability: Enhancing libraries that translate between CFF and OOXML while preserving metadata fidelity.
No comments yet. Be the first to comment!