Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes
We present an intuitive strategy for predicting the effect of sequence variation on splicing. In contrast to transcriptional elements, splicing elements appear to be strongly position dependent. We demonstrated that exonic binding of the normally intronic splicing factor, U2AF65, inhibits splicing. Reasoning that the positional distribution of a splicing element is a signature of its function, we developed a method for organizing all possible sequence motifs into clusters based on the genomic profile of their positional distribution around splice sites. Binding sites for serine/arginine rich (SR) proteins tended to be exonic whereas heterogeneous ribonucleoprotein (hnRNP) recognition elements were mostly intronic. In addition to the known elements, novel motifs were returned and validated. This method was also predictive of splicing mutations. A mutation in a motif creates a new motif that sometimes has a similar distribution shape to the original motif and sometimes has a different distribution. We created an intraallelic distance measure to capture this property and found that mutations that created large intraallelic distances disrupted splicing in vivo whereas mutations with small distances did not alter splicing. Analyzing the dataset of human disease alleles revealed known splicing mutants to have high intraallelic distances and suggested that 22% of disease alleles that were originally classified as missense mutations may also affect splicing. This category together with mutations in the canonical splicing signals suggest that approximately one third of all disease-causing mutations alter pre-mRNA splicing.