Introduction
CheckM is a bioinformatics software package designed to evaluate the quality of genome assemblies, particularly those derived from metagenomic and single‑cell sequencing projects. The tool estimates two key parameters - completeness and contamination - by examining the presence and copy number of a curated set of single‑copy marker genes that are expected to be conserved within specific phylogenetic groups. By providing an objective assessment of genome quality, CheckM supports downstream analyses such as taxonomic assignment, functional annotation, and comparative genomics.
Since its initial release, CheckM has become a standard component of many microbial genomics pipelines. Its widespread adoption is attributed to its high accuracy, ease of use, and compatibility with large datasets. The following article provides a comprehensive overview of CheckM, covering its theoretical foundations, practical implementation, and role within the broader field of microbial genome analysis.
Background and Context
Genome Assembly in Microbial Genomics
Sequencing technologies have made it possible to generate vast amounts of DNA reads from environmental samples or isolated cells. Assembling these reads into contiguous sequences (contigs) is a fundamental step in reconstructing the genetic blueprint of organisms. Conventional genome assembly relies on reference genomes or high sequencing depth to produce high‑quality assemblies. However, many environments contain uncultured microorganisms, and the resulting assemblies are often fragmented, incomplete, or contaminated by DNA from multiple species.
Metagenome‑assembled genomes (MAGs) and single‑cell amplified genomes (SAGs) represent the most common outputs of microbial genomics studies. Both data types are prone to assembly artifacts, such as chimeric contigs or incomplete coverage, which necessitate reliable quality assessment tools. Accurate assessment of completeness and contamination informs researchers whether a given assembly is suitable for functional inference or phylogenetic analysis.
Need for Quality Assessment
Incomplete or contaminated genomes can mislead biological conclusions. For example, missing genes may result in underestimation of metabolic capabilities, while contamination may falsely suggest novel metabolic pathways. Therefore, it is essential to quantify assembly quality in a reproducible and automated manner. Traditional methods, such as manual inspection or manual curation, are laborious and subjective. Computational tools that provide objective metrics enable large‑scale studies to maintain data integrity.
Overview of CheckM
Core Features
CheckM performs three primary functions:
- Estimation of genome completeness based on the presence of expected marker genes.
- Estimation of contamination based on the copy number of these marker genes.
- Taxonomic profiling to assign a phylogenetic label to the genome, which informs the choice of marker gene set.
The software is distributed as a command‑line tool written in Python and C++ for performance optimization. It can process individual genome assemblies or batch files and outputs results in machine‑readable formats such as tabular files and JSON.
Underlying Methodology
CheckM’s methodology rests on two fundamental concepts: a database of universal, single‑copy marker genes and a hierarchical phylogenetic framework. The marker gene database is partitioned into clades (e.g., phylum, class, order) with each clade associated with a specific set of marker genes. The selection of markers is informed by phylogenomic studies that identified genes conserved across broad taxonomic groups yet typically present in single copies.
During analysis, CheckM aligns the input genome’s predicted proteins against the marker gene database using HMM (Hidden Markov Model) searches. The number of recovered marker genes, along with their multiplicity, informs the calculation of completeness and contamination:
- Completeness is estimated as the fraction of expected marker genes that are present.
- Contamination is inferred from the proportion of marker genes that appear in multiple copies, suggesting that sequences from other organisms are present.
By combining completeness and contamination scores, CheckM generates a qualitative assessment of genome quality, typically categorized as high, medium, or low quality based on established thresholds.
Key Concepts
Completeness Estimation
Completeness reflects how much of the organism’s genome is represented in the assembly. CheckM estimates completeness by dividing the number of detected marker genes by the total number of marker genes expected for the genome’s taxonomic clade. The expected number of markers varies by clade; more inclusive clades contain fewer markers to ensure robustness across diverse lineages.
Completeness scores range from 0 to 100 percent. Assemblies with >90 % completeness are generally considered near‑complete, whereas scores below 50 % indicate severe fragmentation or loss of genomic regions.
Contamination Estimation
Contamination estimates the extent to which extraneous DNA is present in the assembly. It is calculated by counting the proportion of marker genes that are found in more than one copy. If a marker gene appears twice, it suggests that at least two genomes share that gene; however, other scenarios, such as gene duplication events, can also increase copy number. CheckM employs statistical models to adjust for such biological variation and to minimize false contamination signals.
Contamination scores are expressed as a percentage of the assembly that is likely derived from other organisms. Assemblies with contamination >5 % are typically flagged as problematic for downstream functional analyses.
Marker Genes
Marker genes are essential components of the bacterial and archaeal proteome that are highly conserved and usually single‑copy. They include genes involved in central metabolism, DNA replication, transcription, and translation. CheckM’s marker gene database contains thousands of HMM profiles derived from well‑annotated reference genomes. The markers are grouped by taxonomic clades, ensuring that the expected gene set is appropriate for the organism’s evolutionary context.
The use of HMMs allows CheckM to capture divergent homologs of marker genes while maintaining stringent detection criteria. The database is regularly updated to reflect new taxonomic insights and to incorporate newly sequenced genomes.
Taxonomic Affiliation
Accurate taxonomic assignment is crucial because the choice of marker set depends on the organism’s phylogeny. CheckM incorporates a lightweight taxonomic classifier that aligns the genome’s predicted proteins against a reference tree. The classifier outputs a taxonomic label at multiple ranks, enabling the subsequent selection of the appropriate marker gene set.
While CheckM’s classifier is less comprehensive than dedicated taxonomic pipelines, it provides sufficient resolution for the majority of genome quality assessments. Researchers may supplement CheckM’s taxonomy with external classifiers such as GTDB‑Tk or PhyloPhlAn if higher resolution is required.
Software Architecture
Programming Languages
CheckM’s core is written in Python 3 for rapid development and integration with existing bioinformatics tools. Performance‑critical modules, particularly the HMM alignment engine, are implemented in C++ and compiled into shared libraries. This hybrid approach balances developer productivity with execution speed.
Dependencies
CheckM requires several external libraries and tools, including:
- HMMER 3.x for profile Hidden Markov Model searches.
- BLAST+ for optional sequence similarity checks.
- Biopython for parsing biological sequence formats.
- NumPy for efficient numerical computations.
All dependencies are listed in the documentation and can be installed via package managers such as conda or pip. A Docker image is also available, encapsulating all required components for reproducible execution.
Data Sources
CheckM’s marker gene database and taxonomic reference tree are derived from the Genome Taxonomy Database (GTDB) and curated public genome repositories. The database includes marker HMM profiles for bacteria, archaea, and eukaryotic organelles. The reference tree is built using concatenated alignments of a subset of universal genes, ensuring accurate phylogenetic placement of input genomes.
Installation and Usage
System Requirements
CheckM runs on Linux and macOS operating systems. A 64‑bit CPU with at least 8 GB of RAM is recommended for batch processing of large genome collections. The software is compatible with Python 3.7 and newer. GPU acceleration is not supported, but the use of multi‑core CPUs can significantly reduce processing time.
Installation Steps
CheckM can be installed via conda or pip:
- Conda installation:
conda install -c bioconda checkm-genome - Pip installation:
pip install checkm-genome
After installation, the user must download the marker gene database and the reference tree. The following command performs this step automatically:
checkm lineage_wf -x fa -t 8 /path/to/assemblies /path/to/output
The command above initiates a complete workflow, including taxonomic assignment, marker search, and quality reporting.
Command‑Line Options
Key options include:
-x– specifies the file extension of input FASTA files.-t– number of CPU threads to use.-l– selects the lineage database (default is GTDB).-o– output directory for result files.--force– re‑run analyses even if result files exist.
CheckM also supports a single_genome mode, which processes a single FASTA file without performing taxonomic classification.
Input and Output Formats
Input files must be in FASTA format and contain predicted protein sequences. CheckM can accept raw nucleotide assemblies if the user supplies a separate gene prediction step, for example using Prodigal.
Output includes:
- A tabular summary file (
summary.txt) containing completeness, contamination, and taxonomy for each genome. - Per‑genome reports in JSON and TSV formats, detailing marker gene presence and copy number.
- Visualizations of completeness vs. contamination distributions across a dataset.
All output files are well‑documented and can be parsed by downstream scripts for further analysis.
Applications
Metagenome‑assembled genomes (MAGs)
MAGs are reconstructed from environmental metagenomic sequencing data. Because MAGs often contain incomplete or chimeric assemblies, CheckM provides an objective measure of their reliability. Researchers use completeness and contamination thresholds (e.g., >90 % completeness and
Single‑cell genomics
Single‑cell amplified genomes (SAGs) can suffer from uneven coverage and amplification bias. CheckM helps assess whether a SAG represents a near‑complete genome or if contamination from other cells or laboratory reagents has occurred. This information guides decisions about further enrichment or resequencing.
Comparative genomics
In comparative studies involving multiple genomes, CheckM ensures that assemblies are comparable by filtering out low‑quality genomes. By standardizing the completeness and contamination parameters, researchers avoid biases in pan‑genome analyses and phylogenetic reconstruction.
Validation and Benchmarking
Benchmark Studies
Several peer‑reviewed studies have evaluated CheckM’s performance against simulated and real datasets. In benchmark tests using mock communities, CheckM achieved an average completeness estimation error of 3 % and a contamination error of 2 %. When compared with the widely used BUSCO pipeline, CheckM demonstrated comparable accuracy for bacterial genomes but outperformed BUSCO for archaeal genomes due to its expanded marker gene set.
Benchmark studies also assessed computational efficiency. CheckM can process 1,000 genomes on a standard 16‑core server in approximately 4 hours, whereas alternative tools such as GUNC or AMBER required 1.5–2 times longer.
Comparison to Other Tools
Other quality assessment tools include BUSCO, GUNC, and AMBER. BUSCO focuses on benchmarking universal single‑copy orthologs but is limited to a narrow set of marker genes and does not provide contamination estimates. GUNC emphasizes contamination detection but relies on reference‑based clustering, which can be computationally intensive. AMBER aggregates metrics from multiple tools, providing a composite score but requiring manual integration.
CheckM’s balanced approach of completeness and contamination estimation, coupled with its robust marker database, positions it as a preferred tool for many researchers.
Extensions and Integrations
Workflow Integration
CheckM has been integrated into several popular bioinformatics pipelines:
- MetaWRAP – an automated metagenomic analysis framework that includes a module for MAG quality filtering.
- Qiime 2 – a microbiome analysis platform that uses CheckM to assess draft genomes assembled from shotgun data.
- Nextflow and Snakemake – workflow management systems that incorporate CheckM as a containerized step, enabling reproducible, scalable execution.
These integrations allow users to seamlessly embed quality assessment into their data processing workflows.
Community Contributions
The CheckM codebase is open source, licensed under the BSD 3‑Clause license. The community actively contributes bug reports, feature requests, and updated marker gene sets. A GitHub repository hosts the source code, issue tracker, and release history. Users can submit pull requests to improve documentation, extend support for new phylogenetic clades, or add compatibility with emerging sequencing technologies.
Limitations and Future Directions
While CheckM is highly accurate for prokaryotic genomes, its performance on eukaryotic genomes is limited because the marker gene database is tailored to bacteria and archaea. Future releases may incorporate eukaryotic markers or allow users to supply custom marker sets.
Another limitation lies in the treatment of gene duplication events. Although CheckM uses statistical models to mitigate false contamination signals, distinguishing between true duplications and contamination remains challenging in highly dynamic genomes.
Future development plans include the expansion of marker gene sets to cover a broader range of environmental lineages, integration of long‑read sequencing error models to improve marker detection, and the implementation of machine‑learning approaches to refine completeness and contamination predictions.
No comments yet. Be the first to comment!