Model card

Model description

This model implements varKoding, a universal DNA identification method that uses exceptionally low-coverage genome sequencing data to create two-dimensional images representing the genomic signature of species. The model was trained on both varKodes and ranked frequency Chaos Game Representations (rfCGRs), enabling it to make predictions using either image representation format.

varKodes and rfCGRs transform k-mer frequency data from raw genomic reads into image representations that can be classified using neural networks. The model architecture is a Vision Transformer (ViT) based on timm/vit_large_patch32_224.orig_in21k, initialized with random weights and trained using the fastai library.

For more information about the varKoding methodology, visit the varKoder project.

Intended uses & limitations

This model is designed for universal DNA barcoding and species identification across the tree of life using minimal genomic data (as little as 10 Mbp). It achieves:

96% precision and 95% recall for species identification
Robust performance across sequencing platforms
Effectiveness with exceptionally low-coverage data (~0.0002×–0.107× genome coverage)

The model is intended for:

Species identification from low-coverage genome skim data
Environmental DNA (eDNA) analysis
Forensic applications requiring species authentication
Biodiversity monitoring using minimal sequencing resources

Limitations include dependency on training data representation and potential reduced accuracy with highly degraded DNA samples.

Training and evaluation data

This model represents the most comprehensive implementation of varKoding, trained on all taxa available in the NCBI Sequence Read Archive (SRA). The training encompasses:

Universal coverage across prokaryotes and eukaryotes
Validation on diverse datasets including fungi, plants, animals, and bacteria
Testing on over 1,100 families across different kingdoms
Performance optimization using multi-label classification to handle uncertainty

The complete datasets and metadata are available via Harvard Dataverse at: https://doi.org/10.7910/DVN/IMOX0S

Outuput interpretation

Outputs are multi-label predictions with confidence scores.

Taxonomic prediction labels have the following format:

"Taxonomy_{rank}:{ID}.{name}"

For example, "Taxonomy_family:7042.Curculionidae" means that this is a prediction at the family level. The taxon ID in the NCBI taxonomy is 7042 and the taxon name is Curculionidae

Check NCBI Taxonomy for more details on taxon IDs and names.

Other than taxonomy, the model was trained on a few properties such as the kind of library and sequencing platform. In these cases, we still have a {key}:{value} format. For example, "Platform:ILLUMINA" means that the sequencing platform is Illumina and "LibraryStrategy:WGS" means the it is a whole-genome sequencing library. These predictions help help diagnose problems, since the model may be less acurate under some conditions (e.g. high-error Nanopore sequencing for species-level identification).

Citation

Please cite the two references below if using this model.

A thorough description of the training datasets can be found in:

Asprino RC, ..., de Medeiros BAS. 2025. A curated benchmark dataset for molecular identification based on genome skimming. Scientific Data. https://doi.org/10.1038/s41597-025-05230-2

The method was developed and validated as described in:

de Medeiros BAS et al. 2025. A composite universal DNA signature for the tree of life. Nature Ecology & Evolution. https://doi.org/10.1038/s41559-025-02752-1