The Seoighe group carries out research into various aspects of genomics and molecular evolution, including questions that can be tackled using existing data as well as collaborative research with experimental scientists. Recent work has included development of tools for the analysis of deep sequencing data from the T cell receptor and immunoglobulins, application of evolutionary models to infer immune epitopes in viruses, analysis of the diversity of mRNA splicing in the thymus and its potential importance for the avoidance of autoimmunity, identification of genetic with transcriptome-wide effects on RNA processing and the development of computational tools that can be used to disentangle the effects of cell type composition and gene expression variation using gene expression or methylation data.
A sample of some of our work is provided below
Selective constraints acting on stop codons
The evolution of DNA sequences is a great example of the application of continuous-time Markov processes in biology. In the case of sequences that code for proteins these models typically describe the rate at which triplets of nucleotides that encode amino acids (i.e. codons) change over evolutionary time. There are far more codons than there are amino acids and consequently some of these codon substitutions do not alter the encoded amino acid. Comparison of the relative rates of mutations that alter the amino acid to those that leave the amino acid unchanged has traditionally been used to assess the evolutionary selection pressure affecting proteins, highlighting fast evolving proteins or protein-domains that are evolving adaptively and critically important proteins that often evolve extremely slowly. These models do not consider the stop codons, which determine the point at which mRNA translation terminates. We developed a version of the codon substitution models that includes the stop codon and applied it to a large collection of mammalian coding sequence alignments. Surprisingly, we found that substitutions between the alternative stop codons occur less frequently than expected under a neutral model of evolution in the majority of mammalian protein-coding genes. Several studies have reported an involvement of the stop codon in regulating protein abundance. Our results, published in the Journal of Molecular Evolution, (available free on biorxiv) suggest that such functional roles for the stop codon may be widespread.
Mutation rate variation
The rate of germline mutation is a key parameter in molecular evolution and population genetics. As the ultimate source of genetic novelty, germline mutations provide the raw material on which selection acts and the basis for genetic drift over time. On the other hand somatic mutations, although they cannot be transmitted to the next generation, are the basis of the development of cancer and may also be a significant factor in aging. We are interested in inter-individual and inter-specific variation in rates of both germline and somatic mutation. We previously developed a method to study genetic variation in germline mutation rates. Our method makes use of haplotype data and is based on a characteristic pattern of haplotype divergence expected to occur in the context of a mutator allele (an allele or genetic variant that increases the rate of germline mutation). This pattern consists of a number of haplotypes with a peak in the number of derived (i.e. non-ancestral) alleles against a background in which other haplotypes in the population have typical numbers of derived alleles. The results of a simulation that illustrates this pattern of haplotype divergence are shown (to the right). We found that the genomic loci at which these peaks occur in humans are enriched for genes involved in DNA replication and repair. The paper reporting these results is freely available as Seoighe Scally, PLoS Genetics, 2017. We are currently extending our analyses of germline mutation rate variation and developing methods to investigate somatic mutation rate variation as part of a research project funded by Science Foundation Ireland.
LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins. Nucleic Acids Research, 2015 (pdf)
The enormous diversity of T and B cell receptors is generated through recombination of diverse V, (D) and J gene segments, together with somatic mutation processes. LymAnalyzer is a free specialized tool for accurate and rapid mapping of sequencing reads to immune gene segments and alleles in a reference database. It includes extraction of the Complementarity Determining Region 3 (CDR3) and clustering of related clones. In addition to mapping to known immune gene alleles, the tool can infer novel alleles that are absent from the reference database. We are interested in hearing from research groups interested in using this tool or collaborating to customize or develop similar tools for related problems (contact information below).
Promiscuous mRNA splicing under the control of AIRE in medullary thymic epithelial cells. Bioinformatics, 2015 (pdf)
Medullary thymic epithelial cells (mTECs) play a crucial role in the development of self-tolerance (i.e. the body’s ability to recognize and not mount an immune response against its own proteins). It has been known for a long time that there are mechanisms in mTECs to ensure that certain genes that are normally only expressed in very specific tissues are expressed in mTECs. This allows T cells, which mature in the thymus, to be exposed to these ’tissue-restricted antigens’ (TRAs) in the course of their training to distinguish self from non-self proteins. TRAs can also result from specific forms of genes that result from tissue-specific alternative mRNA splicing (the processing of the RNA of a given gene to produce different proteins). Not much is known about how T cells are trained to avoid responding to these TRAs. In this paper we have shown that splice isoform diversity is higher in medullary thymic epithelial cells than in any other tissue type examined and that this diversity of mRNA splicing is dependent on the AIRE gene, which plays a key role in the expression of tissue restricted genes, suggesting that mechanisms exist to ensure that T cells are exposed to diverse splice isoforms.
Identification of broadly neutralizing antibody epitopes in HIV-1 env. Virology Journal (2013) (pdf)
People infected with HIV generally produce antibodies that are capable of neutralizing the virus, but yet ultimately the immune system loses the battle to control this infection. The reason is that HIV evolves within the infected individual to evade almost all immune responses that are mounted against it.
This results in a situation in which plasma from an HIV infected patient can neutralize virus from an earlier time point of infection, but not the viruses obtained at the same time point as the plasma and, generally, not the broad diversity of viruses that are found across the whole HIV pandemic (HIV viruses are incredibly diverse as a result of the virus’ rapid rate of evolution). However, some HIV-infected individuals produce broadly neutralizing antibodies that are capable of neutralizing most viruses. These antibodies are of great interest because if a vaccine can be designed that causes them to be produced it may be effective against the diversity of viruses that an at-risk individual may encounter. The graph depicted here shows data from 7 individuals who produced broadly neutralizing antibodies, with the effectiveness of their antibodies against a broad range of viruses (depicted in the phylogenetic tree) illustrated as a heatmap (graded yellow to red according to effectiveness). We developed evolutionary models to identify the sites in the virus at which the pattern of evolution over the phylogenetic tree tracks the changes in virus neutralization (i.e. tracks the heatmap data for a given patient). We also developed a model (not illustrated here) that can identify collections of sites that are close by in the three-dimensional viral protein structure that show this behaviour and used this to identify candidate conformational epitopes that are targeted by broadly neutralizing antibodies in these patients.
Gene expression deconvolution using CellMix. (pdf)
Renaud Gaujoux, a former PhD student, developed a software package, CellMix, that provides a general computational framework for implementing, developing and testing computational methods for gene expression deconvolution. Biological samples are almost always heterogeneous, consisting of different types of cells that are mixed in varying proportions. The gene expression deconvolution problem consists of disentangling the effects of sample composition from intra-cellular variation in gene expression and our software package, along with an earlier package (NMF) is now widely used for this. An example of the results of application of CellMix to deconvolve gene expression data from blood samples is shown.