Introduction
getfiles is a command-line utility designed to retrieve files from remote servers or local directories based on specified criteria. It operates within a Unix-like environment and can be invoked from the shell, scripts, or other programs. The utility provides a flexible interface for filtering files by name patterns, size, modification dates, and other attributes, and supports copying or moving the selected files to a destination location. getfiles is commonly employed in backup scripts, file synchronization tasks, and data migration processes.
History and Background
Origins
The getfiles utility emerged in the early 2010s as part of the OpenSource File Management Project (OSFMP). The original developers identified a gap in the existing file transfer ecosystem: while tools such as scp, rsync, and wget offered specific capabilities, none combined pattern-based selection, conditional filtering, and bulk transfer into a single command with a straightforward syntax. The OSFMP team addressed this by creating getfiles as a lightweight wrapper around existing file operations, adding a domain-specific language for specifying criteria.
Evolution
Version 1.0, released in 2011, provided basic functionality: a source path, a destination path, and a single pattern argument. Subsequent releases added support for regular expressions, size and date constraints, and multi-threaded transfers. Version 2.0, published in 2014, introduced a plug-in architecture that allowed developers to add new filtering rules. The most recent stable release, 3.5, added encryption support for secure file transfers and an integrated logging system.
Community and Governance
getfiles is maintained by the OSFMP core team and is developed under a permissive BSD-style license. The project has an active mailing list, an issue tracker, and a quarterly release cycle. Community contributions are accepted via pull requests, and the codebase is hosted on a public repository. The governance model emphasizes transparency, with all decisions documented in the project's charter.
Key Concepts
Source and Destination
The source can be a local directory or a remote location accessed via supported protocols such as SSH, FTP, or HTTP. The destination must be a valid path on the local filesystem or a remote endpoint that accepts incoming transfers. getfiles supports both relative and absolute paths and can resolve symbolic links according to user-defined options.
Pattern Matching
Patterns are specified using glob syntax or regular expressions. The utility provides built-in wildcards (*, ?, []) and allows the use of extended regular expression syntax when the --regex flag is enabled. Patterns are applied to the filenames only, not to the full path unless the --fullpath flag is used.
Attribute Filters
getfiles can filter files based on size (--size), modification time (--mtime), and access time (--atime). Filters accept comparison operators (, =, =) and allow expressions such as --size +1M to select files larger than one megabyte. The utility supports unit suffixes (K, M, G) and time suffixes (h, m, d) for human-readable specifications.
Transfer Modes
By default, getfiles copies files to the destination. The --move option changes the behavior to move files after a successful transfer. The utility can also perform a dry run using the --dryrun flag, which reports the actions that would be taken without performing any I/O.
Parallelism and Performance
To accelerate large transfers, getfiles supports parallel execution through the --threads option. The default number of threads is determined by the number of CPU cores. Users can specify a custom thread count or disable parallelism entirely with --singlethread. The utility also implements an internal queue system to balance I/O load and network traffic.
Features
- Pattern-based selection with glob and regex support
- Attribute filtering by size, modification date, and access date
- Support for local and remote sources via SSH, FTP, HTTP, and local filesystem
- Copy, move, and dry-run modes
- Parallel transfer with configurable thread count
- Built-in logging with customizable verbosity levels
- Encryption support for secure transfers over untrusted networks
- Plug-in architecture for extending filters and protocols
- Cross-platform compatibility with POSIX-compliant systems
- Extensive documentation and example scripts
Implementation
Architecture Overview
getfiles is implemented in C++, with the core logic divided into modules: parser, filter engine, transfer engine, and plugin manager. The parser handles command-line arguments and translates them into internal data structures. The filter engine applies selection rules to candidate files, using efficient data structures such as interval trees for size ranges and B-trees for date ranges. The transfer engine orchestrates file I/O and network operations, utilizing asynchronous I/O where supported by the operating system. The plugin manager loads shared libraries at runtime, enabling new protocol support and custom filtering rules.
Dependency Management
The project depends on the Boost libraries for regular expression handling, filesystem manipulation, and threading. Network operations rely on libssh for SSH connections, libcurl for HTTP/FTP, and OpenSSL for encryption. The build system uses CMake, facilitating cross-platform builds on Linux, macOS, and Windows (via MinGW).
Testing and Quality Assurance
A continuous integration pipeline runs unit tests, integration tests, and static analysis on each commit. Unit tests cover argument parsing, filter logic, and transfer behavior. Integration tests simulate transfers to mock servers and validate correctness under various network conditions. Static analysis tools such as clang-tidy and cppcheck are employed to detect code quality issues.
Syntax and Options
General Syntax
The basic syntax of getfiles is as follows:
getfiles [OPTIONS] SOURCE DESTINATION
Where SOURCE is the path to the directory or remote endpoint from which files are selected, and DESTINATION is the local or remote path where the files will be copied or moved.
Common Options
--pattern PATTERN– Specifies a glob or regex pattern. Multiple patterns can be provided.--regex– Enables regex interpretation of patterns.--size COMPARISON– Filters files by size. Example:--size +10M.--mtime COMPARISON– Filters by modification time. Example:--mtime -7dselects files modified within the last week.--atime COMPARISON– Filters by access time.--move– Moves files instead of copying.--dryrun– Performs a dry run.--threads N– Sets the number of parallel threads.--singlethread– Forces single-threaded operation.--log FILE– Writes a log to the specified file.--verbose– Increases log verbosity.--quiet– Suppresses non-error output.--help– Displays usage information.
Remote Source Options
--protocol PROTO– Specifies the protocol (ssh, ftp, http, https).--user USER– SSH or FTP username.--password PASSWORD– Password for authentication.--key FILE– Path to a private key for SSH authentication.--port PORT– Specifies the port for the remote connection.--verify-host– Enables host key verification for SSH.
Advanced Options
--exclude PATTERN– Excludes files matching the pattern.--maxdepth N– Limits traversal to the specified depth.--mindepth N– Requires traversal to reach at least the specified depth.--plugin PATH– Loads a plugin library.--plugin-option KEY=VALUE– Passes options to the loaded plugin.
Examples
Basic Copy
Copy all PNG files from /remote/photos to /local/photos using SSH:
getfiles --protocol ssh --user alice --key ~/.ssh/id_rsa \ /remote/photos /local/photos --pattern *.png
Filtered Move
Move files larger than 5 MB modified in the last 30 days from a local backup directory to an archive folder:
getfiles /mnt/backup /mnt/archive \ --size +5M --mtime -30d --move
Dry Run with Logging
Preview which files would be transferred from an FTP server without performing any I/O, and write the actions to a log file:
getfiles --protocol ftp --user guest --dryrun \ ftp://ftp.example.com/data /tmp/preview \ --pattern *.txt --log preview.log
Parallel Transfer
Transfer files using 8 parallel threads to speed up the process:
getfiles --threads 8 /remote/source /local/destination \ --regex '^report_.*\.pdf$'
Using a Plugin
Load a custom filter plugin that selects files based on an external database lookup:
getfiles --plugin ./plugins/dbfilter.so \ --plugin-option table=files \ /data /backup
Variants and Related Commands
getfilez
getfilez is a variant of getfiles that compresses files on-the-fly before transfer. It adds support for gzip and bzip2 compression streams and integrates with the --compress flag.
copyfiles
copyfiles is an earlier utility that provided basic glob-based copying. It lacks the advanced filtering and remote support present in getfiles but remains available for legacy scripts.
syncfiles
syncfiles is a higher-level tool that uses getfiles under the hood to perform incremental synchronization between directories. It records state in a local database and applies conflict resolution policies.
Use Cases
Backup Automation
System administrators use getfiles to automate the creation of incremental backups. By specifying size and modification time filters, they can exclude trivial files and focus on recent changes. The dry-run feature aids in verifying backup policies before committing resources.
Data Migration
When migrating data between data centers, getfiles helps transfer only the required subset of files. By combining --exclude with regular expressions, operators can skip temporary files, caches, or log archives.
Media Management
Digital archivists employ getfiles to curate large collections of multimedia files. The ability to filter by pattern and size streamlines the process of selecting high-resolution images or long video recordings for archival storage.
Security Audits
Security teams use getfiles to extract logs and configuration files from compromised systems. By running getfiles in dry-run mode, they can identify the exact set of files that will be retrieved before initiating a potentially disruptive transfer.
Research Data Distribution
Academic researchers disseminate datasets to collaborators using getfiles. The encryption and parallel transfer features enable efficient distribution of large, sensitive datasets over wide-area networks.
Security Considerations
Authentication and Authorization
When connecting to remote sources, getfiles supports key-based authentication and password-based authentication. It is recommended to use SSH key authentication to avoid transmitting passwords over the network. The --verify-host option ensures that the remote host's public key is verified against known hosts to mitigate man-in-the-middle attacks.
Encryption
For secure transfers over untrusted networks, getfiles can encrypt the data stream using TLS when communicating over HTTP or HTTPS, and using SSH's built-in encryption for SSH connections. For FTP, the --tls flag enables FTPS, although it may not be supported by all servers.
Logging and Privacy
Log files may contain sensitive information such as source paths and transferred filenames. Operators should secure log files with appropriate filesystem permissions and consider log rotation policies. The --quiet option reduces output to essential messages, which can be helpful when privacy is a concern.
Resource Management
Running getfiles with a high thread count on a resource-constrained system can exhaust memory or network bandwidth. Operators should monitor system metrics and adjust the --threads parameter accordingly. The --maxdepth option can also limit recursive traversal, preventing accidental traversal of large directory trees.
Alternatives
- rsync – A widely-used tool for incremental file transfer that provides robust delta-transfer algorithms. rsync is highly configurable but lacks the pattern-based filtering syntax of getfiles.
- scp – Secure copy over SSH. scp is simple but does not support attribute-based filtering or parallelism.
- wget – Useful for downloading files over HTTP/FTP but does not support complex selection logic or moving files.
- lftp – An advanced FTP client with scripting capabilities. It can perform selective downloads but requires manual scripting for pattern matching.
- duplicity – An encrypted backup tool that uses rsync and GnuPG. It is geared towards full backups rather than selective file transfers.
Community and Development
Contribution Process
New contributors to getfiles are encouraged to fork the repository, create feature branches, and submit pull requests following the project's style guidelines. Reviewers assess code quality, documentation updates, and compatibility with existing features. Major changes undergo a staged review process, including automated testing and, if necessary, a developer discussion on the mailing list.
Release Management
Releases are tagged in the repository following semantic versioning. The project maintains a CHANGELOG.md file summarizing new features, bug fixes, and backward-incompatible changes. Release candidates are announced via the mailing list, and bug triage is conducted to prepare the final release.
Issue Tracking
Issues in the project's issue tracker cover bug reports, feature requests, and documentation improvements. Each issue is labeled accordingly. The tracker also hosts discussions around potential API deprecations and future roadmap items.
Support Channels
Operators can seek help via the project's mailing list or an IRC channel on the #getfiles channel. For critical security issues, operators should file a bug report and optionally coordinate with the project's maintainers for a rapid patch.
License
getfiles is distributed under the MIT License, a permissive open-source license that allows free use, modification, and distribution. The license text is included in the LICENSE file in the project's repository.
Appendices
A.1 Regular Expression Syntax
When using --regex, getfiles accepts standard POSIX extended regular expressions. Common constructs include:
^– Anchor to the start of the string.$– Anchor to the end of the string..*– Matches any sequence of characters.[a-z]– Character class.(option1|option2)– Alternation.
A.2 Plugin API
Plugins implement the following interface:
int plugin_initialize(const char *key, const char *value); int plugin_filter(const struct file_info *file, int *include);
Where file_info contains path, size, timestamps, and user-defined attributes. The plugin can set *include to 1 to include the file in the transfer or 0 to exclude it.
Glossary
- Glob – A simple pattern matching syntax using wildcards such as
*and?. - Regex – Regular expression, a more expressive pattern matching language.
- Attribute Filter – Filtering based on file metadata like size, modification time, or access time.
- Dry Run – Simulating an operation without performing any changes.
- Delta Transfer – An algorithm that transfers only changed portions of files, used by rsync.
Conclusion
getfiles offers a versatile and powerful solution for selective file transfer and management across local and remote systems. Its combination of pattern matching, attribute-based filtering, remote connectivity, parallelism, and security features make it a valuable tool in diverse operational contexts. By integrating getfiles into automation pipelines, operators can streamline backups, migrations, and data distribution while maintaining control over the file selection process.
We have to provide a short answer in a JSON with keys "name", "short_description", "long_description", "categories". The answer should not contain any other keys. Must be a JSON object. We should give a concise summary of the given content. The content is about a file transfer command "getfiles". So we need to produce a JSON object with keys name, short_description, long_description, categories. The short_description should be brief: maybe "A versatile command-line tool for selective file transfer with attribute-based filtering, remote support, and parallelism." Long_description: a concise summary of the command, its features, use cases, alternatives, and community aspects. Categories: array of strings. Potential categories: "File Transfer", "Backup", "Data Migration", "Scripting". Let's produce such JSON. Must not contain any other keys. Provide only that object. Ok. Let's do.json { "name": "getfiles", "short_description": "A versatile command‑line tool for selective file transfer with attribute‑based filtering, remote source support, and parallelism.", "long_description": "getfiles is a powerful utility that allows users to copy or move files from local or remote directories based on complex patterns, size and timestamp filters, and exclusion rules. It supports SSH, FTP, HTTP/HTTPS protocols with key‑based authentication and host verification, optional TLS/SSH encryption, dry‑run and logging features, and configurable parallelism via a thread pool. The tool is useful for backup automation, data migration, media curation, security audits, and research data distribution, offering a simpler alternative to rsync or scp for selective file handling. Community contributions follow a transparent pull‑request workflow, semantic versioning, and regular releases, with a dedicated mailing list for discussion. getfiles also supports plugins for extended filtering logic.", "categories": ["File Transfer",
"Backup",
"Data Migration",
"Scripting"
]
}
```
No comments yet. Be the first to comment!