Biology 391 (Organic Evolution) Lecture: Molecular Evolution

NOTE: These are lecture notes for Biology 391, Organic Evolution, at The University of Tennesee at Martin.  Anyone outside of UT Martin wishing to use these notes or to contact me for additional information should first read the information obtained by clicking here.

Goals:  Learn basic properties of evolution as studied at the DNA level and how these have been applied to studying phylogenies and evolutionary processes

Related Textbook Material: Freeman and Herron (2001) Chapters 4, 13, and 18

Lab Manual Questions over this material are in Lab Manual Chapters XII


The Lecture:

Three main ways  DNA information has been used to study evolution:

  1. Phylogenetic analysis
  2. Study of traditionally studied evolutionary processes at molecular level (adaptation, convergent evolution, key innovations, speciation, evolution and development, etc.)
  3. Studying the evolution of various aspects of the DNA itself (base pair ratios, mutation rates, transposable elements, proportions on non-coding DNA, introns, etc.)
This lecture will focus on the first two of these.

Phylogenetic analysis based on DNA information:

 One can study phylogeny by comparing base-pair sequences for homologous genes in different species, and using each base pair site as the character and the actual base pair as the character state.  Ex:  if the following represent DNA sequence information for the same gene in an outgroup and four ingroup species:

Outgroup:    C C T T A G A A C C A T
Species A    C C C T A A A C C C A G
Species B    C C C T A G A A C C A G
Species C    C C T T A G G A C C A G
Species D    C C T T A G A A C C A T

By outgroup comparison, the presence of C at the third site and the presence of G at the last site are both phylogenetically informative derived character states.

When DNA sequencing was first possible, many systematists (people who study phylogeny) thought DNA sequence data would be clearly superior to phenotypic characters for phylogenetic reconstruction.  DNA sequence is clearly genetic, while the phenotype could be influenced by the environment and not reflect ancestry.  When the same genes are compared in different species, we know we're looking at homologous traits, but phenotypic traits that appear the same may be coded by different genes, inherited differently.  Further, there was reason to expect (we'll go into these reasons below) that much of the DNA was neutral. meaning that differences do not affect fitness so it is not subject to natural selection.  Neutrality was expected to be an advantage for phylogenetic reconstruction because since neutral traits are not subject to natural selection, natural selection can't cause them to show convergent evolution.
 
It turned out that DNA sequences are NOT really perfect for phylogenetic reconstruction -- they also have problems.  We'll discuss several aspects of the use of DNA sequences in phylogenetic reconstruction, potential problems with this use of DNA sequence data, and what people do about them.

There can be problems identifying homologous base pair sites in genes because of indels: mutations that insert or delete one or more base pairs.  As a result, the same gene in different species will have different length, and may be difficult to align, where alignment refers to determination of which base pair sites are homologous.  If we just start at the beginning of a gene and compare species at each site, this will be accurate until we reach the site of an indel.  After that point, the sites compared will not be homologous because they will be shifted up or down in one of the species and not the others.  Alignment can be accomplished by comparing genes and finding the areas where most sites are the same across species (this can be done by visual inspection of printed out sequences or by computer programs designed for this; either way, there can be errors).
 
Now let's consider why many areas of DNA are expected to be neutral with respect to natural selection.  There are several aspects of DNA that are not expected to affect the phenotype, and therefore not affect fitness and not evolve through natural selection.  these include non-coding DNA regions outside of genes such as pseudogenes, which are functionless DNA sequences that are produced when a gene is partially duplicated but missing the parts needed to initiate transcription, so no proteins are made based on this DNA information.  Much of the sequence within introns (parts of DNA that code for parts of the mRNA that are spliced out before the mRNA is used as a basis for making proteins) is also apparently neutral -- it doesn't code for the protein.  Even within parts of DNA that do code for proteins, there are many sites where there can be silent substitutions: mutations, usually in 3rd codon positions, that do not change the amino acid coded.
 
For areas that are neutral, natural selection won't cause convergent evolution, but it turns out that convergetn evolution in DNA can occur by chance.  Such chance convergence occurs because DNA has only four bases (A,T,C,G), so at any site, there are only four possible character states.  It is therefore not unlikely that mutations occurring independently, in different species, at the same site will lead to the presence of the same base.
 
The probability that base-pairs at sites are the same because of chance convergence increases with time since speciation.  Initial mutations will make species different from each other, but subsequently, mutations at these sites may make species evolve independently to be the same by chance.  The following figures hows the expected relationship between DNA differences and time since speciation:
 
During the time period when the graph is increasing, DNA evolution leads to differences among species, and similarity among species will reflect relationships, so DNA base sequences are phylogenetically useful.  Once the graph flattens out, a process termed saturation, the DNA differences are no longer phylogenetically useful because once that time is reached, species are evolving to be similar to each other through chance converence as much as they are evolving differences, so similarity is just as likely to reflect convergence as it is to reflect ancestry.
 
The time it takes for DNA to become saturated (reach the point of saturation) depends on the rate of DNA evoution -- the faster DNA evolves, the sooner it reaches saturation.  The rate of DNA evolution depends on whether it is neutral:  compared to non-neutral DNA, neutral DNA should evolve rapidly since there is no selection against any mutation in neutral DNA (but there will be selection against a lot of mutations that could affect non-neutral DNA).  Evidence that neutral DNA evolves faster comes from comparisons of DNA in different species to see how much species differ in neutral vs. non-neutral stretches of DNA.  Pseudogenes, which are neutral, have some of the highest numbers of differences among species, showing rapid evoution in neutral DNA.  Similarly, for the same gene, silent substitution rates are much higher than are those that change the amino acid coded ("replacement substitutions.")  So neutral DNA will become saturated faster than will non-neutral DNA.
 
Since different DNA stretched evolve at different rates, it is important to choose the appropriate stretch of DNA for a phylogenetic study, to avoid possible problems.  The DNA used must have evolved fast enough so that species will differ from each other but not so fast that it is saturated.  If we're studying close relatives (ex: species in the same genus), we need rapidly evolving DNA or it will not show differences among species, and species have not been evolving separately for long so saturation is less likely to be a serious problem than it would be for more distant relatives, so for close relatives non-coding DNA is appropriate.  For moderately related species, such as families within an order, protein coding genes are more likely to be appropriate: they evolve fast enough so that species will vary (the rate will depend on what protein is coded, some are faster than others) but not fast enough so that they would show much saturation within the time that moderate relatives have been separated, although silent substitutions may need to be tested for evidence of saturation.  For distant relatives, such as classes in a phylum, or phyla in a kingdom), many protein coding genes would be saturated and something slower is needed.  DNA that codes for those parts of the ribosomal RNA that are crucial to ribosome function evolve slowly since mutants would likely negatively impact protein synthesis and be very harmful as a result.
 
People have started finding and using other aspects of DNA in phylogenetic analysis.  One aspect of DNA that is particularly promising for phylogenetic reconstruction is the fact that it may get small sequences of DNA inserted into it.  Parasitic DNA sequences called SINES (short interspersed elements) and LINES (long interspersed elements) can insert themselves into genes.  Such insertions are rare, so it is very unlikely that the same SINE or LINE would insert independently into exactly the same gene.  As a result, we can use insertions of SINEs and LINEs into specific genes as phylogenetic characters and expect very low convergent evolution.  Further, while reversals -- losses of SINEs and LINEs -- do sometimes occur, they usually are identifiable because they lead to loss of more of the DNA sequence than just the SINE or LINE.  As a result, such characters should have very low homoplasy.  This prediction of low homoplasy is supported by a study (discussed in your textbook) of the relationships of whales to the hoofed mammals: in a study of presence/absence of SINEs, LINES in 20 genes, the result was a CI=1 (no homoplasy at all!).
 
We have seen that different areas of DNA evolve at different rates.  Now we can consider whether the same gene should evolve at the same rate in different species.  An argument called the molecular clock hypothesis proposes that it should.  The molecular clock hypothesis states that most mutations that become fixed are neutral, since beneficial mutations are rare and natural selection quickly rids populations of harmful mutations.  The same genes in different species should have approximately the same size and same proportion of neutral DNA.  Neutral DNA is predicted to evolve at a constant rate per generation equal to the mutation rate, so the same gene in different species should evolve at the same rate with respect to generation time.
 
Neutral DNA is predicted to evolve at a constant rate equal to the mutation rate based on the following algebraic argument:  In a diploid population with N individuals, there are 2N copies of each gene (two per individual, N individuals).  Eventually, through drift, all descendents in the population will be descendents of one of those 2N copies.  Since they're neutral, all have the same chance of being the ancestor to the one that's eventually fixed, so the chance of any one of the 2N genes in the population being the ancestor to the one fixed is 1/2N.  Now suppose the mutation rate is u per gene per generation.  The chance of a new mutant in a population = (mutation rate)x(number of genes)=(u)(2N).  Putting these steps together, the chance of a new mutant being the one eventually fixed = (chance of new mutant occurring) x (chance of gene being fixed).  We've calculated the chance of a new mutant occurring to be (u)(2N) and we've calculated the chance of a gene being fixed to be 1/2N so, plugging these into the equation we just wrote for the chance of a new mutant being the one eventually fixed, we get:
 
chance of a new mutant being the one eventually fixed = (u)(2N)(1/2N) = u
 
So the rate of fixation of neutral alleles just equals the mutation rate, a process that is constant on average since it is random, and random processes occur, on average over time, at a constant rate.
 
If the molecular clock hypothesis is true, differences in the number of base pairs between species will directly reflect phylogeny without consideration of what is primitive, what is derived, because since new derived traits evolve at a constant rate we won't have the situation where distant relatives wil be similar just because of retention of higher than usual numbers of primitive states.  This would be useful for phylogenetic analysis, especially since finding an appropriate outgroup can be problematic -- sometimes the necessary previous studies to find out what species would make good outgroups have not been done, and sometimes appropriate outgroups have all gone extinct, leaving distant relatives as the only possible choices for outgroups.
 
If the molecular clock hypothesis is true, to put a time scale on a phylogeny we need to know the rate at which a particular stretch of DNA (for example, a
particular gene) evolves. To do this, we figure out how many differences among species in this gene have evolved in a certain length of time. To find the length
of time, we need to use fossils. Here are the basic steps for determining the rate of evolution of a gene:

     1.Find at least two modern species for which the date of speciation can be determined from the fossil record. What we need are two species with a good
       fossil record so that we can find fossils from the time that we can tell, from the fossil record, they were speciating. We then date these fossils using
       methods of radioactive dating that use the constant rate of decay of radioactive elements to put dates on fossils. This tells us how long it has been
       since these two species evolved into separate species -- the time since speciation.
     2.Use molecular techniques to determine the DNA sequence of the same gene in each of our two modern species. Determine the number of differences
       in DNA sequence between these two genes.
     3.We know that the differences in DNA sequence that we just counted in step 2 must have evolved since the two species speciated and we know the
       time since they speciated from step 1. So we now determine the rate of DNA evolution for the gene we're studying as:
       rate=(# DNA differences between the two modern species)/time since speciation
     4.Now we can use this rate to determine the dates of speciation for other species, for which we can not determine a date of speciation from fossils.
       Remember that we are assuming that the rate of DNA evolution for this gene (which we calculated in step 3) is constant. So now we can take any two
       modern species and count the number of DNA differences in this gene between them. We then take the number of DNA differences between them
       and divide by the rate of DNA evolution (from step 3) and that tells us the time since speciation.

The method just described only works if the molecular clock hypothesis is true. The molecular clock hypothesis is, as the name implies, a hypothesis --
we do NOT know that it is true, although we have some theoretical reason (as discussed above) to expect that it may be true for DNA that is neutral with
respect to selection. So now we need to consider how to test the molecular clock hypothesis. We can test the molecular clock hypothesis using something
called the relative rates test which we do as follows:

     1.We determine the phylogeny for a group of species. Suppose, for example, that we determine the phylogeny of four species, A,B, C, and D, to be the
       following.
 

         

 
     2.If DNA is evolving at a constant rate, pairs of species that have been evolving separately from each other for the same amount of time should have the
       same number of DNA differences between them. Any time we can compare an outgroup to several ingroup species, the outgroup is equally related to
       all the ingroup species, so each ingroup species has been evolving separately from the outgroup for the same amount of time. So the number of
       differences between each ingroup species and the outgroup should be the same for all ingroup species if the molecular clock hypothesis is true. On
       the phylogeny above, A is the outgroup to B, C, and D. This means that the number of differences between A and B should be the same as the number
       of differences between A and C and the number of differences between A and D. Also on the phylogeny above, B is the outgroup to C and D (that is,
       B is equally related to C and D.) As a result, B and C should have the same number of DNA differences between them as do B and D.

If we find, for our phylogeny, that the numbers of DNA differences between pairs of equally related species are in fact the same, then the molecular clock
hypothesis is supported and we are justified in using it to find dates on our phylogeny. If not, then the molecular clock hypothesis is not supported, and we
can not find dates on our phylogeny (unless we know them from a good fossil record.)

Evidence from real groups that have been studied indicates that sometimes, for some species and some genes, the molecular clock hypothesis is supported -- it
passes the relative rates test. Other times, for other species and other genes, it is not supported. This suggests we can not assume it is always true; we need to
test it using tests such as the relative rates test if we are to use it to study phylogeny and to put dates of speciation onto our phylogenies.
 
Using organelle DNA in phylogenetic analysis:  remember that DNA occurs in some organelles, mitochondria and chloroplasts, as well as in the nucleus.  Such DNA has often been used in phylogenetic analysis.  In vertebrates, mitochondrial DNA has been used for many analyses because it is easier to study (for molecular reasons) than nuclear DNA.  In addition, it has some features that make it particularly suited for studies of the phylogenies of populations within species.  It has a high mutation rate and is thus relatively rapidly evolving (as with other DNA, neutral parts evolve fastest), so populations in species and individuals within populations have often evolved differences from each other that can be phylogenetically informative.  It is maternally inherited and non-recombining, so each haplotype (haploid genotype) reflects just one ancestral pathway.  As a result, it is possible to make phylogenies of the haplotypes of many individuals in a population to learn about processes that are occurring that may affect speciation.  For example, if haplotypes within populations turn out to be each others closest relatives, we have evidence of low gene flow, suggesting populations are evolving independantly and may be in the process of speciation.
 
A potential problem in mtDNA is something called lineage sorting bias, a process through which mtDNA haplotype relationship may come not to reflect relationships among species.  This can occur if an ancestral species is polymorphic -- that is, has several different mtDNA haplotypes.  These haplotypes will be related to each other in some way, having evolved to be different within the species.  It may turn out that as this ancestral species undergoes speciation, that those haplotypes that are most related may end up in less related species.

Return to index of lectures