Centromere reference models for human chromosomes X and Y satellite arrays
Abstract
The human genome remains incomplete, with multi-megabase sized gaps representing the endogenous centromeres and other heterochromatic regions. These regions are commonly enriched with long arrays of near-identical tandem repeats, known as satellite DNAs, that offer a limited number of variant sites to differentiate individual repeat copies across millions of bases. This substantial sequence homogeneity challenges available assembly strategies, and as a result, centromeric regions are omitted from ongoing genomic studies. To address this problem, we present a locally ordered assembly across two haploid human satellite arrays on chromosomes X and Y, resulting in an initial linear representation of 3.83 Mb of centromeric DNA within an individual genome. To further expand the utility of each centromeric reference sequence, we evaluate sites within the arrays for short-read mappability and chromosome specificity. As satellite DNAs evolve in a concerted manner, we use these centromeric assemblies to assess the extent of sequence variation among 372 individuals from distinct human populations. In doing so, we identify two ancient satellite array variants in both X and Y centromeres as determined by array length and sequence composition. This study provides an initial linear representation and comprehensive sequence characterization of a regional centromere and establishes a foundation to extend genomic characterization to these sites as well as to other repeat-rich regions within complex genomes.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2013
- DOI:
- 10.48550/arXiv.1307.0035
- arXiv:
- arXiv:1307.0035
- Bibcode:
- 2013arXiv1307.0035M
- Keywords:
-
- Quantitative Biology - Genomics
- E-Print:
- 31 pages, 5 figures, 9 pages of supplemental material