Introduction
CheckM is a computational framework designed to evaluate the quality of genome assemblies derived from environmental sequencing projects. By estimating metrics such as completeness and contamination, the tool provides researchers with an objective assessment of the reliability of individual genome bins or metagenome‑assembled genomes (MAGs). The methodology relies on a curated set of marker genes that are expected to be present in single copies within a given phylogenetic lineage. CheckM has become a standard component of many metagenomic pipelines and is frequently cited in studies that reconstruct microbial genomes from complex communities.
Background
Metagenomics and Genome Binning
High‑throughput sequencing of environmental samples generates vast amounts of DNA that represent a mixture of organisms. Metagenomic analysis seeks to recover the genomic content of each constituent species by assembling short reads into contigs and then grouping those contigs into discrete bins that correspond to individual genomes. The resulting MAGs can be used to infer phylogenetic relationships, metabolic capabilities, and ecological roles. However, assembly artifacts, uneven coverage, and strain heterogeneity often lead to incomplete or contaminated bins. Quality assessment tools are therefore essential for distinguishing reliable genomes from those that are fragmented or chimeric.
Marker Gene Approaches
To quantify genome completeness, researchers compare the presence of a predefined set of universal single‑copy marker genes against the assembled contigs. A high proportion of marker genes indicates that the assembly covers most of the expected genomic space, while missing markers suggest incomplete recovery. Contamination is inferred when duplicate copies of marker genes appear within a bin, implying that sequences from multiple organisms have been incorrectly merged. The accuracy of this approach depends on the quality and breadth of the marker gene database, as well as the phylogenetic resolution of the reference genomes.
Development of CheckM
CheckM was first introduced in 2015 by Parks and colleagues as a response to the growing need for a robust, scalable tool capable of handling the increasing diversity of metagenomic datasets. The initial release provided a suite of scripts that combined hidden Markov model (HMM) searches with phylogenetic placement. Since then, the project has released multiple updates that expand the marker gene database, improve computational performance, and integrate additional quality metrics such as strain heterogeneity and assembly quality indicators.
Core Features
Completeness Estimation
Completeness is calculated as the fraction of lineage‑specific marker genes detected in a genome bin. For each bin, CheckM identifies the most appropriate phylogenetic lineage based on marker gene presence and then applies the corresponding marker set to estimate coverage. Completeness values are reported as percentages, with higher values indicating more complete genomes.
Contamination Estimation
Contamination is assessed by counting duplicated marker genes within a bin. A duplication of a marker gene suggests that sequences from multiple organisms have been assembled together. CheckM normalizes this count against the total number of markers, yielding a contamination percentage. In addition, the tool reports a strain heterogeneity metric that captures the presence of divergent alleles of the same marker gene, indicating mixed populations or closely related strains.
Phylogenetic Placement
To select the correct marker set, CheckM performs a rapid phylogenetic assignment of each bin. The process involves building a marker gene profile, aligning it to a reference tree, and then determining the closest ancestor. This hierarchical approach ensures that marker sets are applied to bins of the most appropriate evolutionary depth, thereby improving accuracy for both deeply and shallowly represented lineages.
Marker Gene Database
CheckM’s database consists of over 1,500 marker gene families, each represented by HMM profiles derived from high‑quality reference genomes. The marker set is partitioned into lineage‑specific groups ranging from broad domains (Bacteria, Archaea) to finer taxonomic ranks (phylum, class, order, family). This structure allows CheckM to adjust completeness and contamination thresholds based on the expected variability of marker genes at different taxonomic levels.
Methodology
Input Requirements
- FASTA file(s) containing assembled contigs or scaffolds for a single bin.
- Optional: taxonomic classification of the bin (e.g., from Kraken or GTDB‑TK).
- Access to the CheckM marker gene database.
Marker Gene Detection
CheckM scans the input sequences using HMMER against the marker gene HMMs. Each detection is recorded with alignment statistics such as bit score, E‑value, and alignment length. The tool applies thresholds to filter out weak or partial matches, ensuring that only reliable markers contribute to the quality metrics.
Phylogenetic Assignment Algorithm
The algorithm constructs a concatenated marker gene sequence from the detected markers, aligns it to a reference phylogenetic tree built from complete genomes, and then identifies the nearest ancestor node. The lineage assignment is updated iteratively as additional markers are found, allowing the tool to refine its estimate of the bin’s evolutionary position.
Completeness Calculation
Once the lineage is assigned, CheckM retrieves the marker set associated with that lineage. The completeness is computed as:
- Number of detected marker genes (N_detected).
- Number of marker genes in the lineage set (N_total).
- Completeness = (Ndetected / Ntotal) × 100 %.
Contamination Calculation
CheckM counts the number of duplicate marker genes (N_duplicates) across the bin. Contamination is reported as:
- Contamination = (Nduplicates / Ntotal) × 100 %.
Strain Heterogeneity
Duplicated markers are further examined to determine whether the duplicates are nearly identical or diverge significantly. Divergent duplicates are flagged as strain heterogeneity, which can arise from co‑assembly of closely related strains or from horizontal gene transfer events. The heterogeneity metric is expressed as a percentage of markers with divergent alleles.
Installation and Usage
System Requirements
CheckM is written in Python and requires Python 3.6 or later. It can be installed on Linux, macOS, or Windows (via WSL). The software depends on the following external tools:
- HMMER 3.1 or higher for HMM searches.
- DIAMOND 2.0 for rapid protein searches (optional).
- Biopython for sequence parsing.
- NumPy and Pandas for data handling.
Installation Methods
- Conda: The simplest method is to use the bioconda channel. The command
conda install -c bioconda checkm-genomeinstalls the binary along with all dependencies. - Docker: A Docker image is provided on Docker Hub (
quay.io/biocontainers/checkm-genome). Users can pull the image and run CheckM in a containerized environment. - Source: The source code can be cloned from the GitHub repository (
git clone https://github.com/Ecogenomics/CheckM.git). After cloning, runpip install .to install the package and dependencies.
Command‑Line Interface
The primary command is checkm, which supports multiple sub‑commands. A typical usage for evaluating a single bin is:
checkm lineage_wf -x fa /path/to/contigs/ /output/directory/
Where -x fa specifies the file extension of the input FASTA files. The lineage_wf sub‑command runs the full workflow, including phylogenetic assignment, marker detection, and quality metric calculation. Results are stored in the specified output directory and include:
- A TSV file with completeness, contamination, and heterogeneity scores.
- Per‑lineage statistics and marker gene summaries.
- Log files documenting the analysis steps.
Batch Processing
CheckM can process multiple bins simultaneously by placing all FASTA files in a single directory and running the workflow on that directory. The tool automatically generates separate reports for each bin and compiles a summary table that lists all genomes evaluated in the batch.
Applications
Microbial Ecology
Researchers studying microbial communities in environments such as soils, oceans, or the human gut use CheckM to ensure that reconstructed genomes are of sufficient quality for downstream ecological analysis. High completeness allows for confident inference of metabolic pathways, while low contamination reduces the risk of attributing functions to the wrong organism.
Evolutionary Studies
By providing accurate phylogenetic placement and quality metrics, CheckM facilitates comparative genomics and phylogenomics. Complete genomes enable reconstruction of ancestral states and the investigation of gene family evolution across lineages.
Biotechnology and Synthetic Biology
In industrial microbiology, genomes of novel or engineered strains are assembled from metagenomic or single‑cell sequencing data. CheckM is employed to confirm that the assembled genomes meet the purity and completeness thresholds required for regulatory approval or product development.
Public Genome Repositories
Databases such as NCBI GenBank, EMBL‑EBI, and the Genome Taxonomy Database (GTDB) utilize CheckM metrics when curating entries derived from metagenomic assemblies. The quality scores are displayed alongside each genome, providing users with a quick assessment of assembly reliability.
Evaluation
Benchmarking Studies
Multiple independent evaluations have compared CheckM to other tools such as BUSCO and Anvi‑’o. In these studies, CheckM consistently achieved high correlation between predicted completeness and true completeness measured against reference genomes. Contamination estimates also showed strong agreement, especially for bins with high marker gene coverage.
Performance Metrics
- Accuracy: Completeness estimates within ±5 % of true values for well‑represented lineages.
- Speed: Runtime scales linearly with the number of bins; a typical 10‑bin batch on a single core completes within 30 minutes.
- Memory Usage: Peak RAM usage under 4 GB for most bins; higher memory is required for very large genomes (>10 Mb).
Comparison to BUSCO
BUSCO also relies on universal marker genes but is primarily geared toward single‑cell genomes and transcriptomes. CheckM’s lineage‑specific marker sets provide finer resolution for bacteria and archaea, while BUSCO’s broader gene families may be less informative for certain environmental taxa. Both tools complement each other in many workflows.
Limitations
Marker Gene Bias
Marker genes are assumed to be universally present and single‑copy, but horizontal gene transfer and gene duplication can violate these assumptions. In genomes with extensive duplication of marker genes, contamination estimates may be artificially high.
Computational Load for Very Large Datasets
While CheckM is efficient for moderate batch sizes, analyzing thousands of bins on a single machine can strain CPU and memory resources. Parallel execution across multiple cores or compute nodes mitigates this issue but requires additional configuration.
Eukaryotic Genomes
CheckM was originally designed for prokaryotic genomes. Although adaptations for eukaryotes exist, the marker sets are not as comprehensive, and completeness estimates may be less reliable for complex genomes with large intronic regions or extensive repetitive sequences.
Future Directions
Expansion of Marker Gene Databases
Ongoing efforts aim to incorporate markers from recently discovered phyla and archaea, thereby reducing reference bias and improving accuracy for under‑represented taxa.
Integration with Assembly Pipelines
Future releases plan tighter coupling with assemblers such as MEGAHIT and metaSPAdes, enabling real‑time quality assessment during assembly and automated binning corrections.
Enhanced Strain Resolution
Advanced algorithms to distinguish closely related strains within a bin will allow for finer contamination detection and enable the recovery of strain‑level genomes from mixed communities.
Support for Eukaryotic Metagenomics
Development of eukaryote‑specific marker sets and workflows is underway to extend CheckM’s applicability to fungal, protist, and plant genomes recovered from metagenomic samples.
See also
- Metagenomics
- Genome Binning
- Marker Gene Analysis
- Phylogenomics
- Busco
- Anvi‑o
- Genome Taxonomy Database
References
The following literature provides additional context and validation for the methods and findings described in this article. Citations are listed alphabetically by author surname.
- Baker, B. J. & et al. (2019). "The Impact of Marker Gene Selection on Microbial Genome Completeness Estimation." Microbial Genomics, 5(4), 100–112.
- Gomez, P., & Smith, L. (2021). "Evaluating Bacterial Genome Quality: A Comparative Study of CheckM, BUSCO, and Anvi‑o." Frontiers in Microbiology, 12, 678900.
- Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). "CheckM: Assessing the Quality of Microbial Genomes." Genome Biology, 16, 564.
- Rodriguez, R., & Smith, J. (2020). "Phylogenetic Placement Strategies for Environmental Bins." Journal of Bioinformatics, 17(2), 233–245.
- Uritskiy, G. V., et al. (2018). "Anvi‑o: An Integrated Platform for Microbial Analysis." Scientific Reports, 8(1), 100–110.
- Wang, S., et al. (2022). "Marker Gene Bias in Horizontal Gene Transfer Detection." Applied and Environmental Microbiology, 88(1), e02023-21.
- Xu, C., & Liu, Y. (2018). "Lineage‑Specific Marker Gene Sets for Bacterial Taxa." International Journal of Genomics, 2018, Article ID 123456.
- Yin, Y., & Liu, Y. (2017). "DIAMOND: Fast Protein Alignment for Large‑Scale Data." Bioinformatics, 33(15), 2169–2171.
No comments yet. Be the first to comment!