Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the kmer Distributions in the Human Genome
Abstract
The amount of nonunique sequence (nonsingletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughputsequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the kmer distribution of the human reference genome. The kmer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of nonsingleton kmers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of nonsingletons decreases slowly with increasing k, and can be fitted by piecewise powerlaw functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of kmers exhibit long tails in a powerlawlike trend, and rank frequency plots exhibit a concave Zipf's curve. The location of the most frequent 1000mers comprises 172 kilobaseranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.
 Publication:

arXiv eprints
 Pub Date:
 August 2013
 arXiv:
 arXiv:1308.6240
 Bibcode:
 2013arXiv1308.6240L
 Keywords:

 Quantitative Biology  Genomics
 EPrint:
 5 figures