Goals: Learn basic properties of evolution as studied at the DNA level and how these have been applied to studying phylogenies and evolutionary processes
Related Textbook Material: Freeman and Herron (2001) Chapters 4, 13, and 18
Lab Manual Questions over this material are in Lab Manual Chapters XII
The Lecture:
Three main ways DNA information has been used to study evolution:
Phylogenetic analysis based on DNA information:
One can study phylogeny by comparing base-pair sequences for homologous genes in different species, and using each base pair site as the character and the actual base pair as the character state. Ex: if the following represent DNA sequence information for the same gene in an outgroup and four ingroup species:
Outgroup: C C T T A G A A C C A T
Species A C C C T A A A C C C A G
Species B C C C T A G A A C C A G
Species C C C T T A G G A C C A G
Species D C C T T A G A A C C A T
By outgroup comparison, the presence of C at the third site and the presence of G at the last site are both phylogenetically informative derived character states.
When DNA sequencing was first possible, many systematists (people who
study phylogeny) thought DNA sequence data would be clearly superior to
phenotypic characters for phylogenetic reconstruction. DNA sequence
is clearly genetic, while the phenotype could be influenced by the environment
and not reflect ancestry. When the same genes are compared in different
species, we know we're looking at homologous traits, but phenotypic traits
that appear the same may be coded by different genes, inherited differently.
Further, there was reason to expect (we'll go into these reasons below)
that much of the DNA was neutral. meaning that differences do not affect
fitness so it is not subject to natural selection. Neutrality was
expected to be an advantage for phylogenetic reconstruction because since
neutral traits are not subject to natural selection, natural selection
can't cause them to show convergent evolution.
It turned out that DNA sequences are NOT really perfect for phylogenetic
reconstruction -- they also have problems. We'll discuss several
aspects of the use of DNA sequences in phylogenetic reconstruction, potential
problems with this use of DNA sequence data, and what people do about them.
There can be problems identifying homologous base pair sites in genes
because of indels: mutations that insert or delete one or more base
pairs. As a result, the same gene in different species will have
different length, and may be difficult to align, where alignment
refers to determination of which base pair sites are homologous.
If we just start at the beginning of a gene and compare species at each
site, this will be accurate until we reach the site of an indel.
After that point, the sites compared will not be homologous because they
will be shifted up or down in one of the species and not the others.
Alignment can be accomplished by comparing genes and finding the areas
where most sites are the same across species (this can be done by visual
inspection of printed out sequences or by computer programs designed for
this; either way, there can be errors).
Now let's consider why many areas of DNA are expected to be neutral
with respect to natural selection. There are several aspects of DNA
that are not expected to affect the phenotype, and therefore not affect
fitness and not evolve through natural selection. these include non-coding
DNA regions outside of genes such as pseudogenes, which are functionless
DNA sequences that are produced when a gene is partially duplicated but
missing the parts needed to initiate transcription, so no proteins are
made based on this DNA information. Much of the sequence within introns
(parts of DNA that code for parts of the mRNA that are spliced out before
the mRNA is used as a basis for making proteins) is also apparently neutral
-- it doesn't code for the protein. Even within parts of DNA that
do code for proteins, there are many sites where there can be silent
substitutions: mutations, usually in 3rd codon positions, that do not
change the amino acid coded.
For areas that are neutral, natural selection won't cause convergent
evolution, but it turns out that convergetn evolution in DNA can occur
by chance. Such chance convergence occurs because DNA has only four
bases (A,T,C,G), so at any site, there are only four possible character
states. It is therefore not unlikely that mutations occurring independently,
in different species, at the same site will lead to the presence of the
same base.
The probability that base-pairs at sites are the same because of chance
convergence increases with time since speciation. Initial mutations
will make species different from each other, but subsequently, mutations
at these sites may make species evolve independently to be the same by
chance. The following figures hows the expected relationship between
DNA differences and time since speciation:
During the time period when the graph is increasing, DNA evolution
leads to differences among species, and similarity among species will reflect
relationships, so DNA base sequences are phylogenetically useful.
Once the graph flattens out, a process termed saturation, the DNA
differences are no longer phylogenetically useful because once that time
is reached, species are evolving to be similar to each other through chance
converence as much as they are evolving differences, so similarity is just
as likely to reflect convergence as it is to reflect ancestry.
The time it takes for DNA to become saturated (reach the point of saturation)
depends on the rate of DNA evoution -- the faster DNA evolves, the sooner
it reaches saturation. The rate of DNA evolution depends on whether
it is neutral: compared to non-neutral DNA, neutral DNA should evolve
rapidly since there is no selection against any mutation in neutral DNA
(but there will be selection against a lot of mutations that could affect
non-neutral DNA). Evidence that neutral DNA evolves faster comes
from comparisons of DNA in different species to see how much species differ
in neutral vs. non-neutral stretches of DNA. Pseudogenes, which are
neutral, have some of the highest numbers of differences among species,
showing rapid evoution in neutral DNA. Similarly, for the same gene,
silent substitution rates are much higher than are those that change the
amino acid coded ("replacement substitutions.") So neutral DNA will
become saturated faster than will non-neutral DNA.
Since different DNA stretched evolve at different rates, it is important
to choose the appropriate stretch of DNA for a phylogenetic study, to avoid
possible problems. The DNA used must have evolved fast enough so
that species will differ from each other but not so fast that it is saturated.
If we're studying close relatives (ex: species in the same genus), we need
rapidly evolving DNA or it will not show differences among species, and
species have not been evolving separately for long so saturation is less
likely to be a serious problem than it would be for more distant relatives,
so for close relatives non-coding DNA is appropriate. For moderately
related species, such as families within an order, protein coding genes
are more likely to be appropriate: they evolve fast enough so that species
will vary (the rate will depend on what protein is coded, some are faster
than others) but not fast enough so that they would show much saturation
within the time that moderate relatives have been separated, although silent
substitutions may need to be tested for evidence of saturation. For
distant relatives, such as classes in a phylum, or phyla in a kingdom),
many protein coding genes would be saturated and something slower is needed.
DNA that codes for those parts of the ribosomal RNA that are crucial to
ribosome function evolve slowly since mutants would likely negatively impact
protein synthesis and be very harmful as a result.
People have started finding and using other aspects of DNA in phylogenetic
analysis. One aspect of DNA that is particularly promising for phylogenetic
reconstruction is the fact that it may get small sequences of DNA inserted
into it. Parasitic DNA sequences called SINES (short interspersed
elements) and LINES (long interspersed elements) can insert themselves
into genes. Such insertions are rare, so it is very unlikely that
the same SINE or LINE would insert independently into exactly the same
gene. As a result, we can use insertions of SINEs and LINEs into
specific genes as phylogenetic characters and expect very low convergent
evolution. Further, while reversals -- losses of SINEs and LINEs
-- do sometimes occur, they usually are identifiable because they lead
to loss of more of the DNA sequence than just the SINE or LINE. As
a result, such characters should have very low homoplasy. This prediction
of low homoplasy is supported by a study (discussed in your textbook) of
the relationships of whales to the hoofed mammals: in a study of presence/absence
of SINEs, LINES in 20 genes, the result was a CI=1 (no homoplasy at all!).
We have seen that different areas of DNA evolve at different rates.
Now we can consider whether the same gene should evolve at the same rate
in different species. An argument called the molecular clock hypothesis
proposes that it should. The molecular clock hypothesis states that
most mutations that become fixed are neutral, since beneficial mutations
are rare and natural selection quickly rids populations of harmful mutations.
The same genes in different species should have approximately the same
size and same proportion of neutral DNA. Neutral DNA is predicted
to evolve at a constant rate per generation equal to the mutation rate,
so the same gene in different species should evolve at the same rate with
respect to generation time.
Neutral DNA is predicted to evolve at a constant rate equal to the
mutation rate based on the following algebraic argument: In a diploid
population with N individuals, there are 2N copies of each gene (two per
individual, N individuals). Eventually, through drift, all descendents
in the population will be descendents of one of those 2N copies.
Since they're neutral, all have the same chance of being the ancestor to
the one that's eventually fixed, so the chance of any one of the 2N genes
in the population being the ancestor to the one fixed is 1/2N. Now
suppose the mutation rate is u per gene per generation. The chance
of a new mutant in a population = (mutation rate)x(number of genes)=(u)(2N).
Putting these steps together, the chance of a new mutant being the one
eventually fixed = (chance of new mutant occurring) x (chance of gene being
fixed). We've calculated the chance of a new mutant occurring to
be (u)(2N) and we've calculated the chance of a gene being fixed to be
1/2N so, plugging these into the equation we just wrote for the chance
of a new mutant being the one eventually fixed, we get:
chance of a new mutant being the one eventually fixed = (u)(2N)(1/2N)
= u
So the rate of fixation of neutral alleles just equals the mutation
rate, a process that is constant on average since it is random, and random
processes occur, on average over time, at a constant rate.
If the molecular clock hypothesis is true, differences in the number
of base pairs between species will directly reflect phylogeny without consideration
of what is primitive, what is derived, because since new derived traits
evolve at a constant rate we won't have the situation where distant relatives
wil be similar just because of retention of higher than usual numbers of
primitive states. This would be useful for phylogenetic analysis,
especially since finding an appropriate outgroup can be problematic --
sometimes the necessary previous studies to find out what species would
make good outgroups have not been done, and sometimes appropriate outgroups
have all gone extinct, leaving distant relatives as the only possible choices
for outgroups.
If the molecular clock hypothesis is true, to put a time scale on a
phylogeny we need to know the rate at which a particular stretch of DNA
(for example, a
particular gene) evolves. To do this, we figure out how many differences
among species in this gene have evolved in a certain length of time. To
find the length
of time, we need to use fossils. Here are the basic steps for determining
the rate of evolution of a gene:
1.Find at least two modern species for which
the date of speciation can be determined from the fossil record. What we
need are two species with a good
fossil record so that we can find
fossils from the time that we can tell, from the fossil record, they were
speciating. We then date these fossils using
methods of radioactive dating
that use the constant rate of decay of radioactive elements to put dates
on fossils. This tells us how long it has been
since these two species evolved
into separate species -- the time since speciation.
2.Use molecular techniques to determine the
DNA sequence of the same gene in each of our two modern species. Determine
the number of differences
in DNA sequence between these
two genes.
3.We know that the differences in DNA sequence
that we just counted in step 2 must have evolved since the two species
speciated and we know the
time since they speciated from
step 1. So we now determine the rate of DNA evolution for the gene we're
studying as:
rate=(# DNA differences between
the two modern species)/time since speciation
4.Now we can use this rate to determine the
dates of speciation for other species, for which we can not determine a
date of speciation from fossils.
Remember that we are assuming
that the rate of DNA evolution for this gene (which we calculated in step
3) is constant. So now we can take any two
modern species and count the number
of DNA differences in this gene between them. We then take the number of
DNA differences between them
and divide by the rate of DNA
evolution (from step 3) and that tells us the time since speciation.
The method just described only works if the molecular clock hypothesis
is true. The molecular clock hypothesis is, as the name implies, a hypothesis
--
we do NOT know that it is true, although we have some theoretical reason
(as discussed above) to expect that it may be true for DNA that is neutral
with
respect to selection. So now we need to consider how to test the molecular
clock hypothesis. We can test the molecular clock hypothesis using something
called the relative rates test which we do as follows:
1.We determine the phylogeny for a group of
species. Suppose, for example, that we determine the phylogeny of four
species, A,B, C, and D, to be the
following.
2.If DNA is evolving at a constant rate, pairs
of species that have been evolving separately from each other for the same
amount of time should have the
same number of DNA differences
between them. Any time we can compare an outgroup to several ingroup species,
the outgroup is equally related to
all the ingroup species, so each
ingroup species has been evolving separately from the outgroup for the
same amount of time. So the number of
differences between each ingroup
species and the outgroup should be the same for all ingroup species if
the molecular clock hypothesis is true. On
the phylogeny above, A is the
outgroup to B, C, and D. This means that the number of differences between
A and B should be the same as the number
of differences between A and C
and the number of differences between A and D. Also on the phylogeny above,
B is the outgroup to C and D (that is,
B is equally related to C and
D.) As a result, B and C should have the same number of DNA differences
between them as do B and D.
If we find, for our phylogeny, that the numbers of DNA differences between
pairs of equally related species are in fact the same, then the molecular
clock
hypothesis is supported and we are justified in using it to find dates
on our phylogeny. If not, then the molecular clock hypothesis is not supported,
and we
can not find dates on our phylogeny (unless we know them from a good
fossil record.)
Evidence from real groups that have been studied indicates that sometimes,
for some species and some genes, the molecular clock hypothesis is supported
-- it
passes the relative rates test. Other times, for other species and
other genes, it is not supported. This suggests we can not assume it is
always true; we need to
test it using tests such as the relative rates test if we are to use
it to study phylogeny and to put dates of speciation onto our phylogenies.
Using organelle DNA in phylogenetic analysis: remember
that DNA occurs in some organelles, mitochondria and chloroplasts, as well
as in the nucleus. Such DNA has often been used in phylogenetic analysis.
In vertebrates, mitochondrial DNA has been used for many analyses because
it is easier to study (for molecular reasons) than nuclear DNA. In
addition, it has some features that make it particularly suited for studies
of the phylogenies of populations within species. It has a high mutation
rate and is thus relatively rapidly evolving (as with other DNA, neutral
parts evolve fastest), so populations in species and individuals within
populations have often evolved differences from each other that can be
phylogenetically informative. It is maternally inherited and non-recombining,
so each haplotype (haploid genotype) reflects just one ancestral
pathway. As a result, it is possible to make phylogenies of the haplotypes
of many individuals in a population to learn about processes that are occurring
that may affect speciation. For example, if haplotypes within populations
turn out to be each others closest relatives, we have evidence of low gene
flow, suggesting populations are evolving independantly and may be in the
process of speciation.
A potential problem in mtDNA is something called lineage sorting
bias, a process through which mtDNA haplotype relationship may come
not to reflect relationships among species. This can occur if an
ancestral species is polymorphic -- that is, has several different mtDNA
haplotypes. These haplotypes will be related to each other in some
way, having evolved to be different within the species. It may turn
out that as this ancestral species undergoes speciation, that those haplotypes
that are most related may end up in less related species.