Checkm

Introduction

CheckM is a computational framework designed to evaluate the quality of microbial genome assemblies, particularly those derived from metagenomic data. By estimating genome completeness and contamination, the tool assists researchers in determining the reliability of draft genomes obtained through sequencing projects. The method relies on phylogenetic marker genes to infer these metrics and is widely adopted in microbial genomics, comparative genomics, and environmental microbiology studies.

History and Background

Origins in Metagenomic Assembly

The rise of high-throughput sequencing technologies in the early 2010s enabled the reconstruction of microbial genomes directly from environmental samples. Metagenome-assembled genomes (MAGs) became a cornerstone of studies exploring microbial diversity. However, the fragmented nature of MAGs and the difficulty in assessing their quality created a need for systematic evaluation tools. CheckM emerged in response to this challenge, providing a standardized approach to quality assessment.

Development Milestones

CheckM was first introduced in 2015 as a standalone software package written in Python. Subsequent versions incorporated enhancements such as improved marker gene databases, streamlined installation, and compatibility with various assembly pipelines. The release of CheckM2, an updated iteration, extended the scope of assessment to include additional taxonomic levels and improved accuracy through machine learning integration.

Community Adoption

Since its initial release, CheckM has been integrated into numerous metagenomic workflows, including the Human Microbiome Project, Earth Microbiome Project, and marine microbiome studies. The tool’s widespread use has been reflected in a growing number of peer-reviewed publications that cite its methodology for genome quality control.

Key Concepts

Genome Completeness

Completeness represents the proportion of expected single-copy marker genes present in an assembly. A high completeness score indicates that the genome likely contains most of its constituent genes, while a low score suggests missing genomic fragments. CheckM calculates completeness by comparing the presence of markers against a curated reference set for the organism’s taxonomic clade.

Genome Contamination

Contamination reflects the presence of duplicate or additional marker genes that may arise from co-assembly of multiple organisms or from assembly artifacts. The metric is derived from the frequency of multiple copies of single-copy markers relative to the expected count. Elevated contamination signals potential misassembly or contamination with unrelated sequences.

Marker Gene Sets

CheckM utilizes a hierarchical database of marker genes that are phylogenetically conserved and typically present as single copies within a genome. These markers are organized by taxonomic rank, allowing the tool to adapt its reference set to the specific lineage of the input assembly. The database includes thousands of markers across bacteria and archaea, with updates released regularly.

Phylogenetic Placement

Determining the appropriate marker set requires estimating the taxonomic affiliation of the genome. CheckM performs this by constructing a phylogenetic profile from the detected markers and aligning it against a reference phylogeny. The placement informs the subsequent calculations of completeness and contamination.

Software Architecture

Programming Environment

CheckM is implemented in Python 3, leveraging libraries such as NumPy, SciPy, and Biopython for data handling and numerical computations. The tool is packaged with Conda for dependency management, facilitating installation across Linux, macOS, and Windows platforms.

Input and Output Formats

Users provide genome assemblies as FASTA files or compressed FASTQ sequences for assembly pipelines. Output consists of a tab-delimited summary file containing completeness, contamination, and other metrics for each input genome, as well as optional detailed reports in JSON format. A set of visualizations, including completeness-contamination scatter plots, can be generated using matplotlib.

Workflow Integration

CheckM can be invoked as a standalone command or integrated into larger bioinformatics pipelines through wrappers in Snakemake, Nextflow, or workflow scripts. Its modular design allows selective execution of marker detection, phylogenetic placement, or metric calculation stages.

Methodology

Marker Identification

The first step involves scanning the input assembly for marker genes using Hidden Markov Models (HMMs). Each marker HMM is searched against the assembly with HMMER, and matches are filtered based on E-value and alignment coverage thresholds to reduce false positives.

Phylogenetic Profiling

Detected markers are compared against a precomputed phylogenetic tree. The tool identifies the clade that maximizes the number of shared markers, assigning the genome to a specific taxonomic level (species, genus, family, etc.).

Completeness Calculation

For the assigned clade, the tool counts the number of unique marker genes present. Completeness is then calculated as: (number of unique markers detected / total markers for the clade) × 100 percent.

Contamination Estimation

Contamination is estimated by summing the counts of duplicated marker genes and dividing by the expected count of unique markers. The formula used is: (sum of marker duplicates / total markers for the clade) × 100 percent.

Quality Categorization

CheckM classifies genomes into four quality tiers based on completeness and contamination thresholds: high-quality (>90% completeness, 20% contamination). These categories aid researchers in filtering results for downstream analyses.

Applications

Metagenome-Assembled Genomes (MAGs)

CheckM is extensively used to vet MAGs prior to functional annotation and comparative genomics. Researchers apply the tool to discard low-quality assemblies that could skew ecological interpretations or lead to erroneous phylogenetic placements.

Single-Cell Genomics

In single-cell sequencing projects, genome coverage may be uneven. CheckM assists in determining whether amplified genomes contain sufficient genetic content for reliable downstream analyses, such as metabolic reconstruction.

Phylogenomics

Phylogenomic studies require high-quality genomes to avoid misleading tree topologies. By filtering assemblies based on CheckM metrics, scientists can assemble cleaner datasets for robust phylogenetic inference.

Genome Catalog Construction

Large-scale genome catalogues, such as the Unified Human Gastrointestinal Genome (UHGG) collection, rely on stringent quality controls. CheckM provides an objective framework to include only genomes that meet predefined completeness and contamination thresholds.

Environmental Monitoring

Environmental DNA (eDNA) monitoring of microbial communities often produces draft genomes. CheckM helps ensure that the genomes used in biomonitoring reflect true biological diversity rather than assembly artifacts.

Limitations

Marker Set Coverage

While CheckM’s marker databases are comprehensive, certain lineages, especially poorly characterized archaea or extremophiles, may have incomplete marker sets, potentially biasing completeness estimates.

Horizontal Gene Transfer

High rates of horizontal gene transfer can introduce additional copies of marker genes, inflating contamination scores. Distinguishing genuine contamination from such events remains challenging.

Dependence on Accurate Phylogenetic Placement

Misplacement of a genome in the phylogenetic tree can lead to the use of an inappropriate marker set, thereby affecting metric calculations. In complex communities, ambiguous placement may occur.

Computational Resources

Marker detection via HMMER is computationally intensive for large assemblies, requiring significant CPU time and memory, which may limit scalability in very large metagenomic projects.

Future Directions

Integration with Machine Learning

Recent iterations of CheckM incorporate machine learning models to predict completeness and contamination from raw assembly statistics, potentially reducing reliance on marker detection.

Expansion to Eukaryotes

Efforts are underway to extend the framework to assess eukaryotic genome assemblies by identifying conserved eukaryotic marker genes and adjusting contamination metrics accordingly.

Real-Time Assessment

Developments in streaming analytics may enable CheckM to provide real-time quality metrics during assembly, allowing dynamic adjustment of assembly parameters.

Enhanced Marker Databases

Continual updates to the marker database, including the addition of lineage-specific markers discovered through long-read sequencing, will improve the accuracy of completeness estimations across diverse taxa.

BUSCO (Benchmarking Universal Single-Copy Orthologs): Provides completeness estimates based on lineage-specific orthologs but does not explicitly estimate contamination.
QUAST (Quality Assessment Tool for Genome Assemblies): Offers assembly metrics such as N50, misassembly rates, and reference-based alignment statistics.
RefineM: A pipeline that refines genome bins before quality assessment.
MaxBin: Automates binning of metagenomic contigs and uses CheckM internally for quality evaluation.

Search

Table of Contents

Introduction

History and Background

Origins in Metagenomic Assembly

Development Milestones

Community Adoption

Key Concepts

Genome Completeness

Genome Contamination

Marker Gene Sets

Phylogenetic Placement

Software Architecture

Programming Environment

Input and Output Formats

Workflow Integration

Methodology

Marker Identification

Phylogenetic Profiling

Completeness Calculation

Contamination Estimation

Quality Categorization

Applications

Metagenome-Assembled Genomes (MAGs)

Single-Cell Genomics

Phylogenomics

Genome Catalog Construction

Environmental Monitoring

Limitations

Marker Set Coverage

Horizontal Gene Transfer

Dependence on Accurate Phylogenetic Placement

Computational Resources

Future Directions

Integration with Machine Learning

Expansion to Eukaryotes

Real-Time Assessment

Enhanced Marker Databases

Related Tools

References & Further Reading

Share this article

See Also

Cosmic Horror

Clases

Fernseher

Air Shocks

Hdtv Indoor Antenna

Suggest a Correction

Comments (0)

More Articles

Pacing Thermometer Prompts Mapping Tension Across Scenes

Outline Divergence Branches When Brainstorming Alternate Endings

Novel Synopsis Beat Boards Mixed With Stochastic Expansions

Nonlinear Timeline Sanity Checks Aided By Branching Summaries

Narrative Distance Vocabulary For Omniscient Close Third Hybrids

Categories