TReconLM
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
Model Variants
We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths:
- L = 60
- L = 110
- L = 180
Each model supports reconstruction from cluster sizes between 2 and 10.
How to Use
A Colab notebook is available in our GitHub repository under trace_reconstruction.ipynb
, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from Hugging Face.
Training Details
- Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
- Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets).
For full experimental details, see our paper.
Limitations
Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data.