>>>>
(STATISTICAL) PATTERNS IN THE HUMAN GENOME NUCLEOTIDE SEQUENCE
At its simplest level, the Human Genome Nucleotide Sequence is a (long!)
sequence of four
symbols, G (Guanine), A (Adenine), C (Cytosine) and T (Thymine). This sequence lends
itself to many varying analyses, not all of which have been explored:
- Simple statistical features, e.g. average occurence of each letter, relative
frequency of occurrence of different short subsequences, etc.
- Markovian distribution analyses, essentially the probabilities of a given letter
being followed by another letter
- Z4 analysis, i.e. mathematical analysis based upon identifying A,C,T,G
with the digits 0,1,2,3
- Calculation of the Kolmogorov complexity of different (gene) sequences, i.e. (in a
loose sense) the degree of randomness in the sequence.
Based upon the data at http://www.ebi.ac.uk/genomes/mot/index.html (see a snapshot at http://grobner.nuigalway.ie/g2.txt)
The project will involve carrying out some of these (or other) analyses, depending on the students
particular interests. Please contact michael.mcgettrick@nuigalway.ie or drop by room 437 for more information.