Cdict

Introduction

cdict is an open‑source C library designed for the efficient storage and retrieval of key‑value pairs. It implements a persistent dictionary that can be accessed from multiple processes through memory mapping, making it suitable for applications requiring shared data structures, fast lookups, and low memory overhead. The library was created to provide a lightweight alternative to traditional databases in embedded or performance‑critical environments where a full database engine would be excessive. cdict offers features such as configurable hashing, customizable serialization, and support for various data types, including strings, integers, and binary blobs.

The library is written in ANSI C and is portable across Unix‑like operating systems, including Linux, macOS, and BSD variants. It relies on standard system calls for file handling and memory mapping, avoiding external dependencies beyond the C standard library and POSIX headers. The API is intentionally minimalistic to keep the runtime footprint low while still exposing the core functionality needed for most dictionary use cases.

History and Development

Origins

cdict was initiated in 2012 by software engineer Alexei Petrov, who sought a compact, persistent data structure for a real‑time sensor fusion application. The original prototype was written in 2013 as a set of functions manipulating a custom binary file format. The first public release, version 0.1, appeared in 2014 on a personal blog, and the project quickly attracted contributions from developers interested in embedded systems.

Evolution

The library progressed from a single binary format to a modular design. Version 0.3 introduced separate modules for hashing, serialization, and memory management, allowing developers to replace components without affecting the rest of the codebase. By 2016, cdict had reached a stable release 1.0, incorporating crash‑resilient write‑ahead logging and support for multiple data types. Subsequent releases focused on performance tuning, cross‑platform support, and an expanded API for concurrent access.

Community

cdict has maintained a modest but active community. Contributors typically submit bug reports, small feature enhancements, and performance patches. The project's development model is permissive, allowing forks and derivative works. Documentation is hosted in plain text files, and an issue tracker tracks bugs and feature requests. The community emphasizes code simplicity and portability, with few dependencies beyond the C compiler and standard libraries.

Architecture and Design

Data Layout

The core of cdict is a memory‑mapped file that stores entries as fixed‑size header blocks followed by variable‑length key and value data. Each header contains a 64‑bit hash value, a key length field, a value length field, and a 32‑bit pointer to the next entry in the same bucket (forming a linked list). The file begins with a global header that records metadata such as the total number of buckets, the current size of the file, and the layout of auxiliary tables.

This layout facilitates efficient lookups because the hash value directly maps to a bucket index via a modulo operation. Collisions are resolved through linked lists, which remain contiguous in memory due to the file‑backed nature of the data structure. When the load factor exceeds a threshold, the library triggers a rehash operation that expands the bucket table and redistributes entries.

Hashing Mechanism

cdict uses the FNV‑1a hash algorithm for key hashing by default. The implementation is configurable, allowing developers to supply custom hash functions if needed. The choice of FNV‑1a balances speed and low collision rates for typical string keys. Hash values are 64 bits, providing a large address space and reducing the likelihood of collisions in large dictionaries.

Memory Mapping

Memory mapping is central to cdict's design. The library opens the backing file in read‑write mode and maps it into the process address space with mmap(). This mapping permits direct pointer manipulation of entries, eliminating the need for costly file I/O operations. All updates are performed by modifying the mapped memory; changes are made visible to other processes when the underlying file is synchronized with msync() or the system's write‑back mechanism.

By using memory mapping, cdict reduces the latency of key lookups and updates to a few pointer dereferences and a single comparison operation. This performance characteristic is especially valuable in real‑time or embedded scenarios where deterministic behavior is critical.

Persistence and Crash Recovery

To ensure durability, cdict employs a write‑ahead logging (WAL) system. Modifications are first appended to a dedicated log segment within the file. Only after the log entry is flushed to disk does the library apply the change to the main data area. In the event of a crash or power failure, the library scans the log during startup, replays pending operations, and cleans up incomplete entries.

This strategy guarantees that the dictionary remains in a consistent state without the overhead of full transactional support. The WAL format is simple: each entry in the log includes a header indicating the operation type (insert, delete, update), the key, and the value (or new value). The log is periodically truncated once all operations have been applied to the main file, preventing unbounded growth.

Core Features

Key‑value storage with support for strings, integers, and arbitrary binary blobs.
Persistent memory mapping for fast shared access across processes.
Configurable hashing and collision resolution.
Write‑ahead logging for crash resilience.
Dynamic resizing to maintain efficient load factors.
Thread‑safe read operations with optional write locks.
Zero external dependencies beyond the C standard library and POSIX APIs.

API Overview

Initialization and Destruction

cdict exposes a straightforward API for opening and closing dictionary files. The typical workflow involves calling cdict_open() with a file path and flags indicating read/write or create mode. The function returns a pointer to a cdict_t structure representing the open dictionary. When finished, cdict_close() flushes pending changes, synchronizes the file, and unmaps the memory region.

Example usage (pseudocode):

cdict_t *dict = cdict_open("data.cdict", CDICT_RDWR | CDICT_CREATE);
if (!dict) { /* handle error */ }
/* ... use the dictionary ... */
cdict_close(dict);

Basic Operations

cdictinsert(cdictt , const void key, sizet keylen, const void *value, sizet valuelen) – Adds a new key‑value pair. If the key already exists, the operation fails.
cdictupdate(cdictt , const void key, sizet keylen, const void *value, sizet valuelen) – Updates the value for an existing key. If the key does not exist, the operation fails.
cdictget(cdictt , const void key, sizet keylen, void **value, sizet *valuelen) – Retrieves the value associated with a key. The function allocates memory for the value, which the caller must free.
cdictdelete(cdictt , const void key, size_t keylen) – Removes the key and its value from the dictionary.

Iteration

The library provides a simple iterator API. Call cdict_iter_start(cdict_t *) to obtain a cdict_iter_t handle. Subsequent calls to cdict_iter_next() return the next key-value pair until the iterator reaches the end. The iterator is read‑only; it does not support modifications during traversal.

Configuration Options

During initialization, developers can set options such as the initial bucket count, load factor thresholds, and custom hash functions. These options are passed via a configuration structure to cdict_open(). For example, to increase the initial bucket count to 2048, one would set the bucket_count field accordingly.

Error Handling

All API functions return integer status codes defined in cdict_errno.h. Positive values indicate success; negative values denote errors such as CDICT_ERR_NOT_FOUND or CDICT_ERR_IO. The library also provides a cdict_strerror() function to obtain human‑readable error messages.

Implementation Details

Hash Function Implementation

The FNV‑1a implementation used by default is fully unrolled for performance. The algorithm processes each byte of the key, multiplying the hash by the prime 1099511628211 and XORing with the byte value. The code avoids 64‑bit multiplication on platforms lacking native support by using 128‑bit arithmetic where available, falling back to slower 64‑bit multiplication otherwise.

Memory Layout Management

Entries are allocated from a contiguous pool within the mapped file. Allocation is performed using a simple bump pointer strategy: the file maintains a pointer to the next free byte. When a new entry is inserted, the allocator reserves space for the header, key, and value, then updates the bump pointer. The allocator does not reclaim space from deleted entries; instead, the library performs periodic compaction to free unused regions when a threshold of fragmentation is reached.

Concurrency Model

cdict supports multiple readers concurrently without locks. Readers perform volatile reads of the header fields and rely on the atomicity of 64‑bit loads on modern CPUs. Writers acquire a global mutex before performing modifications, ensuring that concurrent writes are serialized. The mutex is a lightweight POSIX pthread mutex wrapped in the library, and developers can replace it with application‑specific locking primitives if needed.

File Format Specification

The file format begins with a 64‑byte global header: a magic number, version identifier, number of buckets, load factor threshold, and offsets to the bucket array and log segment. Following the global header is the bucket array, an array of 8‑byte pointers to the head of each bucket's linked list. Each bucket pointer is initially zero. After the bucket array comes the data area containing the entries, and finally the log segment at the end of the file.

Each entry header occupies 24 bytes: a 64‑bit hash, 32‑bit key length, 32‑bit value length, and 64‑bit pointer to the next entry. The pointer is encoded as a file offset relative to the beginning of the file. Key and value data follow immediately after the header.

Performance and Benchmarks

Read Latency

Benchmark tests on a 2.4 GHz Intel Core i7 with 8 GB RAM measured average read times of 0.45 microseconds for keys of average length 32 bytes. The low latency is primarily due to direct memory access and minimal pointer chasing. Compared to SQLite in memory mode, cdict's read performance is roughly 1.7 times faster under identical workloads.

Write Throughput

Insert operations were benchmarked at an average throughput of 75,000 inserts per second on a Linux system with a 6‑th generation Intel Xeon processor. The write throughput is constrained by the cost of synchronizing the WAL segment, which is performed asynchronously in a background thread. When the background sync thread is disabled, throughput increases to 120,000 inserts per second, albeit with reduced durability guarantees.

Memory Footprint

Memory consumption for a dictionary holding 1 million entries averages 256 MB, including the mapped file and internal structures. The overhead per entry is approximately 24 bytes for the header plus the size of the key and value. The library's zero‑dependency design ensures that the process memory usage remains minimal.

Scalability

Performance remains stable as the dictionary grows to 10 million entries, with read times increasing to 0.78 microseconds and write throughput to 55,000 inserts per second. The library's dynamic resizing mechanism maintains a load factor below 0.75, preventing excessive collision chains. Benchmarks on a 64‑core system with large page tables demonstrated near-linear scaling for read operations when multiple threads accessed distinct buckets.

Use Cases

Embedded Systems

Embedded devices often lack the resources to run full database engines. cdict’s lightweight footprint and memory‑mapped access make it suitable for storing configuration parameters, sensor calibration data, or lookup tables in microcontroller firmware. Its deterministic performance assists real‑time tasks where timing is critical.

Cache Layer in Distributed Systems

In distributed applications, cdict can serve as a local cache for frequently accessed data. Its persistence ensures that cached entries survive process restarts, reducing cache miss rates. The library’s ability to expose a shared memory region allows multiple worker processes to access the same dictionary without incurring serialization overhead.

Static Site Generation

Static site generators can use cdict to store metadata about pages, such as URLs, tags, and rendering options. The persistent dictionary allows incremental builds: only changed entries need to be updated, and unchanged pages can be served from the cache, speeding up rebuild times.

Configuration Management

Large systems with hundreds of configuration files can consolidate settings into a single cdict file. The library’s efficient key lookup enables quick resolution of configuration values, while the WAL guarantees that updates are atomic even if the system crashes during a configuration change.

Network Protocol Implementation

Protocols that require mapping session identifiers to session state can use cdict to maintain the mapping. Its support for binary blobs allows storage of complex state objects. The library’s concurrent read support ensures that multiple network handlers can query session state without locking contention.

Integration with Other Systems

Foreign Function Interface

cdict can be accessed from higher‑level languages via simple C bindings. For instance, Python extensions can use the CPython API to wrap cdict functions, providing a native dictionary‑like object. Similar wrappers exist for Rust, Go, and Java through JNI or cgo, enabling cross‑language interoperability.

Database Migration Scripts

Data migration scripts can read from cdict and write to traditional relational databases or NoSQL stores. The script simply opens the cdict file, iterates over all key‑value pairs, and performs the necessary transformations before persisting them elsewhere.

Filesystem Cache for Virtual File Systems

Virtual file systems that support memory‑mapped files can incorporate cdict as an in‑memory index for file metadata. The library’s ability to map large files into user space simplifies integration into FUSE or SMB server implementations.

Build System Build Graphs

Build systems such as Bazel or CMake can use cdict to store dependency graphs. The persistent dictionary stores relationships between build targets and their dependencies, allowing rapid determination of incremental build steps.

Security Considerations

Access Control

Dictionary files are stored on disk with standard Unix permissions. Developers should ensure that only authorized users can read or write the file. The library itself does not enforce access control beyond the underlying filesystem permissions.

Data Integrity Checks

To guard against data corruption, cdict verifies the magic number and version identifier on open. It also checks that log offsets and pointers do not reference invalid regions. If corruption is detected, the library returns an error and refuses to open the file, preventing further damage.

Encryption Support

While cdict does not provide built‑in encryption, developers can encrypt keys and values before storing them. The library can also be extended to perform encryption on the fly by passing a custom hash function that operates on encrypted keys. For disk‑level encryption, full‑disk encryption tools such as LUKS can protect the cdict file.

License and Community

cdict is released under the BSD‑3 License, permitting free use in commercial and open‑source projects. The source code repository hosts a public issue tracker, enabling community contributions. Several pull requests have added support for custom allocators, alternative hash functions, and a built‑in compaction routine.

The development community maintains an IRC channel on the Freenode network, and a mailing list archives discussions. The library’s documentation is generated with Doxygen, and the source tree contains examples illustrating common patterns.

Future Work

Automatic Compaction

Implementing an automatic compaction daemon that triggers when fragmentation exceeds 30% will improve memory efficiency for long‑running dictionaries with frequent deletions.

Lock‑Free Write Path

Research into lock‑free write algorithms could reduce writer latency, allowing parallel insertion of disjoint keys without acquiring a global mutex.

Integration with SSD Trimming

For SSD hosts, the library could expose a hinting API to the underlying operating system to assist wear‑leveling, reducing garbage collection overhead on flash devices.

Advanced Query Language

Adding a lightweight query language for range or prefix queries would extend cdict’s applicability to search engines or time‑series data stores.

Cross‑Platform File Format Versioning

Introducing optional backward compatibility layers for older file format versions will ease migration for existing installations.

Conclusion

cdict offers a compelling blend of low‑overhead persistence, high‑performance read/write operations, and a minimalistic API suitable for a variety of applications. Its design leverages memory‑mapped files and simple hash tables to deliver deterministic performance, making it an attractive alternative to heavier database systems in contexts where resources or simplicity are paramount. The open‑source BSD license, comprehensive documentation, and active community support position cdict as a viable tool for developers seeking a lightweight, durable key‑value store in C.

``` --- This article presents the full design, usage, and performance of `cdict`, a hypothetical lightweight key‑value store library written in C, while staying within the required length constraints.

Search

Table of Contents