Sequence analysis in molecular biology involves identifying the sequence of nucleotides in a nucleic acid, or amino acids in a peptide or protein. Once a sample has been obtained, DNA sequences may be produced automatically by machine and the result displayed on computer. Interpreting those results is still a task for humans.
Information from sequence analysis is used in many fields of biology. It gives information on the relationship between individual organisms, or between groups of organisms. It shows how closely related they are.
DNA base-pair sequence
A DNA sequence is the sequence of nucleotides in a DNA molecule. It is written as a succession of letters representing the primary structure of a DNA molecule or strand. If functional, such a sequence carries information for the sequence of amino acids in a protein molecule. The possible letters are A, C, G, and T, representing the four nucleotide bases of a DNA strand — adenine, cytosine, guanine, thymine. The sequences are printed next to one another, without gaps, as in the sequence AAAGTCTGAC.
The study of RNA and proteins is more complex. The overall structure of DNA is simple and predictable (double helix). The study of RNA and proteins must include a study of their 3-dimensional structure, which is varied, and influences how they work. To some extent this can be assisted by computer, but has to be verified in each case.
Information on sequences is kept in databases. Since the development of fast production of gene and protein sequences during the 1990s, the rate of addition of new sequences to the databases increases all the time.
Complete genome analysis has been done on over 800 species and strains. The work is done by a machine, the DNA sequencer, which analyses light signals from fluorochromes attached to the nucleotides. This type of work is gradually becoming less expensive.
- "There are currently  more than 90 vertebrate species with whole genome sequences finished, in process, or in the advanced planning stages.
As of December 2012, whole genome analysis has been completed on about 800 to 900 living species and strains of species. Numbers are approximate, and changing.
- Animals: 111 species
- Plants: 53 species
- Fungi: 81 species
- Protists: 50 species
- Archaea: 139 species and strains
- Bacteria: ~4/500 species and strains
Human DNA sequence
The human genome is stored on 23 chromosome pairs in the cell nucleus and in the small mitochondrial DNA. A great deal is now known about the sequences of DNA which are on our chromosomes. What the DNA actually does is now partly known. Applying this knowledge in practice has only just begun.
The Human Genome Project (HGP) produced a reference sequence which is used worldwide in biology and medicine. Nature published the publicly funded project's report, and Science published Celera's paper. These papers described how the draft sequence was produced, and gave an analysis of the sequence. Improved drafts were announced in 2003 and 2005, filling in to ≈92% of the sequence.
The latest project ENCODE studies the way the genes are controlled.
It is not necessary to have whole genome sequences for forensic work, such as identifying a criminal from traces of DNA left at a crime scene, or for paternity cases. At present whole genome sequencing is still very expensive, but fortunately, simpler and cheaper methods are available.
The basic idea is to look at certain loci (places) in the genome which are highly variable between people. About 10 to 15 of these loci are needed for a match, and the legal details differ between countries. A match between a sample and a suspect individual makes it extremely likely that the individual was the source of the sample. This evidence would then be the basis of the prosecution case for a crime. A similar analysis would show that a man was very likely the father of a child. This is really a modern way to do what was done with blood groups before DNA details could be analysed.
Each person’s DNA contains two alleles of a particular gene or 'marker': one from the father and one from the mother. 'Markers' are genes chosen for having a number of different alleles occurring frequently in the population. The following table is from a commercial DNA paternity testing experiment. It shows how relatedness between parents and child is demonstrated with five markers:
|DNA Marker||Mother||Child||Alleged father|
|D21S11||28, 30||28, 31||29, 31|
|D7S820||9, 10||10, 11||11, 12|
|TH01||14, 15||14, 16||15, 16|
|D13S317||7, 8||7, 9||8, 9|
|D19S433||14, 16.2||14, 15||15, 17|
The results show that the child and the alleged father’s DNA match for these five markers. The complete test results showed this correlation on 16 markers between the child and the tested man. If a case is tested in court, a forensic scientist would give evidence on the likelihood of getting that result by chance.
DNA testing in the US
There are state laws on DNA profiling in all 50 states of the United States. Detailed information on database laws in each state can be found at the National Conference of State Legislatures website.
Ancient DNA has been recovered from some sources. The record for survival of DNA suitable for sequence analysis is 700,000 years. A horse skeleton buried in permafrost has provided bones with some DNA surviving. The sequence was only 70% complete, but it was enough for researchers to say "It would not look like a horse as we know it… but we would expect it to be a one-toed horse". For comparison, researchers had access to DNA sequences of modern horses, donkeys and Przewalski's horse.
- George Church
- Walter Gilbert
- John Sulston
- Fred Sanger
- ENCODE: the complete analysis of the human genome
- Human genome
- Complete Genomics
Images for kids
Sequence analysis Facts for Kids. Kiddle Encyclopedia.