After discovering this same data on another thread along with more discussion than has appeared here (I've taken the liberty of pinging the participants of that discussion), I see what the "mystery" is supposed to be -- it's supposed be why did some sites have multiple mutations while (small) stretches of other sites had none? In other words, why do the mutations appear clustered?
(You know, it would really help if people explained their points and questions in more detail, instead of leaving people to guess what the poster was thinking...)
[LLLICHY wrote:] "U238" that decays thrice, pretty good trick when there is "U238" that does not decay at all in 50,000,000 years.
Actually, no site had mutations "thrice". Three different bases at a given site is only *two* mutations (one original base, plus two mutations from it to something else).
Here's the "mutation map" from the actual DNA data:
--1-12--1-1-1-1--------1112112--1---1-11-1--------1 ALL/nNo mutations ("-") in about half the sites, one mutation at several (17) sites, two mutations at three sites.
The first thing to keep in mind that random processes tend to "cluster" more than people expect anyway. People expect "randomness" to "spread out" somewhat evenly, but instead it's usually more "clumped", for statistical reasons that would be a diversion to go into right now. So "that looks uneven" isn't always a good indication that something truly is non-random.
If you don't believe me on that, I wrote a program which made 23 mutations totally at random on a 51-site sequence, then repeated the process to see what different random outcomes would look like:
10 X$=STRING$(51,"-") 20 FOR I=1 TO 23 30 J%=INT(RND*51)+1 40 C$=MID$(X$,J%,1) 50 IF C$="-" THEN MID$(X$,J%,1)="1" ELSE MID$(X$,J%,1)=CHR$(ASC(C$)+1) 60 NEXT I 70 PRINT X$ 80 GOTO 10Yeah, it's BASIC, so sue me. Here's a typical screenful of the results:
-21---1---2---111----2-----2-1121-------1---1--11-1 -1--1--21-11---1-1--1-1---1----1---21-11111---11--- 3-11---3-----1-----11-2-1---1--1----3--2---1--1---- ---1-1--22--1-1--2-2111--1-1111---1------1-------1- ---32----1-11-1-----1---2-231----1------1-----11--1 ----2---21--1---4----1-------------11-1--111-11-211 11--1-1---1-----1--1------1----3111--1----111-2-1-2 1112---1-3-1----1-1-----1-1------121--111-------1-1 -111121--1----1----1-1-1-1-11-2---1-1-------1-111-- -----------11-1---11-11--------21----12211--1---131 --1-211-1-1----21--11-1-2----1--1----11---11-----11 12---1-13------------2---21-21---11-1-1-1--2------- -----2-1---1-1----21--11-11-1---111-1--111-----2--1 -----1-----1-1-1-1---1-2----11-21-11--1-111---1-21- ---11--1-1-122-1-1-1--1-----2-1-1-1-------1-1---111 --2--11----2--1---12-2----1-1---1-1--1--12----1-1-1 -111-1-----1-1----------1-21111--1-2-11-11-1----11- 11-1--211-1221-----1--1-----11--1-2-1----------11-- -----1-12-11---2-1---11--1-2--1----11---111-1----11 11----1--12---12----1---31---1-11----2--1-11-1----- ---1--111-1--1-1-111----1-21----1-1-3---1------2--1 -2-11----1-1------1------2-1-1--111-111-1-1----1111 1--1--1-1---1-111111--2--1-1------112----2---11----Notice how oddly "clustered" most of them look, including one run which left a 13-site stretch "absolutely untouched", contrary to intuition (while having *4* mutations at a single site!)
Frankly, I don't see anything in the real-life DNA mutation map which looks any different from these truly random runs. Random events tend to cluster more than people expect. That solves the "mystery" right there.
Also, there may be a selection factor -- the GLO gene is a *lot* bigger than this. One has to wonder if this small 51-bp section was presented just because it was the one that looked "least random". That would be a no-no, since one can always hand-select the most deviant subset out of larger sample in order to artificially skew the picture.
However, since there are some interesting evolutionary observations to be made, let's look at that DNA data again, slightly rearranged:
TAC CCC GTG GAG GTG CGC TTC ACT CGG GCG GAC GAC ATC CTG CTG AGC CCC PIG TAC CCC GTG GAG GTA CGC TTC ACT CGC GGG GAC GAC ATC CTG CTG AGC CCC BOS TAC CCC GTA GAG GTG CGC TTC ACC CGA GGC GAT GAC ATT CTG CTG AGC CCC RAT TAC CCC GTG GAG GTG CGC TTC ACC CGA GGT GAT GAC ATC CTG CTG AGC CCG MOUSE TAC CCT GTG GGG GTG CGC TTC ACC CGG GGG GAC GAC ATC CTG CTG AGC CCC GUIN PIG TAC CTG GTG GGG GTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC HUMAN TAC CTG GTG GGG CTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC CHIMPANZEE TAC CCG GTG GGG GTG CGC TTC ACC CAG AG* GAT GAC GTC CTA CTG AGC CCC ORANGUTAN TAA CCG GTG GGG GTG CGC TTC ACC CAA GG* GAT GAC ATC ATA CTG AGC CCC MACAQUEHere I've put spaces between codons, and clustered the closely-related species together: pig/cow as ungulates, rat/mouse for their obvious relationship, guinea pig right below them but separated because of the pseudogene nature of its GLO gene, then primates all in a group, with man's closest relative, the chimp, immediately below him, followed by the more distant orangutan, and the even more distant macaque. Also note that the top four have "working" GLO genes, and the bottom five have "broken" GLO pseudogenes.
First, let's consider just the four species with working GLO genes. Evolution predicts that even over large periods of time, these genes will be "highly conserved", with natural selection weeding out mutations that could "break" the gene. Note that the mutations will still have occurred in individuals of the population, but natural selection will "discourage" that mutation from spreading into the general population.
And before we go any further, let's talk about the "universal genetic code". In all mammals (indeed, in almost all living organisms), each triplet of DNA sites cause a particular amino acid to be formed. The mapping of triplets (called "codons") to amino acids is as follows:
Second Position of Codon | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
T | C | A | G | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
F i r s t P o s i t i o n |
T |
|
|
|
|
|
T h i r d P o s i t i o n |
||||||||||||||||||||||||||||||||||||||||||||||||||||
C |
|
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
A |
|
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
G |
|
|
|
|
|
(The above table imported from http://psyche.uthct.edu/shaun/SBlack/geneticd.html, which also has a nice introduction to the genetic code.)
Another version of the same table with nifty Java features and DNA database lookups can be found here.
The thing which is most relevant to the following discussion is the fact that most of the genetic codes are "redundant" -- more than one codon (triplet) encodes to exactly the same amino acid. This means that even in genes which are required for the organism, certain basepair mutations make absolutely no difference if the change is from one codon which maps into amino acid X to another codon which still maps into amino acid X. (This fact allows certain kinds of evolutionary "tracers" to be "read" from the DNA, as described here).
Now back to our DNA data. The redundancy in the genetic code means that some basepair sites will have more "degrees of freedom" than others (i.e., ways in which they can mutate without disrupting the gene's biological function in any way). Let's look at the four species with working GLO genes again:
TAC CCC GTG GAG GTG CGC TTC ACT CGG GCG GAC GAC ATC CTG CTG AGC CCC PIG TAC CCC GTG GAG GTA CGC TTC ACT CGC GGG GAC GAC ATC CTG CTG AGC CCC BOS TAC CCC GTA GAG GTG CGC TTC ACC CGA GGC GAT GAC ATT CTG CTG AGC CCC RAT TAC CCC GTG GAG GTG CGC TTC ACC CGA GGT GAT GAC ATC CTG CTG AGC CCG MOUSE T T T A T A T T T A T C C T T T T T T T T A A A A A C A A A A A G C G G G G G C C C --- --- --1 --- --1 --- --- --1 --2 -12 --1 --- --1 --- --- --- --1Under each site of the mouse DNA, I've listed the "alternative" bases which could be be substituted for the mouse base at that site WITHOUT ALTERING THE GENE'S FUNCTION (because of genetic code redundancy). And under that I show the "mutation map" of just those four species.
Note that most of the "alternative" bases are in the third base of each codon, *and* that this is where all but one of the mutations have appeared. This is because these were the sites which were "free" to mutate in the way they did, because the mutation was genetically neutral. That doesn't mean that the first and second sites of each codon were immune from mutation, it's just that when mutations did occur at those sites, natural selection weeded them out quickly because they most likely "broke" the GLO gene for the individuals which received that mutuation. What we see above is the results after natural selection has already "filtered" the undesirable mutations and left the ones which "do no harm".
Additionally, the two sites which have mutated twice (i.e. have a "2" in the mutation map) are ones which had more "allowable" mutations. Also note that the sites which had the fewest allowable alternatives (only one alternate letter allowed) didn't have any mutations fix at those sites, which is unsurprising since a "safe" mutation would be less likely to occur there versus a site that "allowed" two or three alternatives.
All this is as predicted by evolutionary theory, you'll note.
It also explains the one anomoly of the original mutation map, which is that the mutation counts do tend to be higher at the third base of a codon.
However... What about the one exception? The pig DNA has had one mutation at a site which does not encode to exactly the same amino acid (which is the case for *all* the other ones). In the pig DNA, the GGG codon (mapping to Glycine) has changed to a GCG codon (mapping to Alanine). What's up with that? Well, one of two things. First and most likely, just as base values in codons have a built-in redundancy, so do the amino acids which make up the proteins which result from the DNA templates. In other words, certain amino acids can be substituted for other ones at some sites in given proteins without making any functional difference. (This "protein functional redundancy" also has implications for "evolutionary tracer" analysis, see here.) That may well be the case for Alanine versus Glycine in the GLO protein, but I'm not enough of a biochemist to be able to say. The other option is that it *does* make some difference in the function of the pig GLO protein, but not enough to "break" the vitamin-C synthesis (as proven by the fact that pigs *can* synthesize vitamin C). So one way or another, it's not a deal-breaker even though pig GLO will not be 100% identical to cow/mouse/rat GLO. It's yet another "allowable" mutation.
More interesting evolutionary observations: The number of mutational differences between pig/cow is 3, the number between mouse/rat is 4, and the difference between rat/cow is 7 -- all roughly as one would expect from the evolutionary relatedness of these animals (cows/pigs and rats/mice are each closer to each other than the rodents are to the ungulates).
Now let's take a close look at the guinea pig:
TAC CCT GTG GGG GTG CGC TTC ACC CGG GGG GAC GAC ATC CTG CTG AGC CCC GUIN PIG --- --1 --- -1- --- --- --- --- --1 --1 --1 --- --- --- --- --- ---The "mutation map" under the guinea pig DNA is compared to the mouse DNA. Fascinating: Note that four of the five mutations are in the third base of a codon, *and* are of the type "allowed" by the genetic code redundancy. This indicates strongly that most of the evolutionary divergence between guinea pigs and mice likely occurred while the guinea pig's ancestors still had a working GLO gene. This is the sort of prediction implied by the evolutionary theory which could be cross-checked by further research of various types, and if verified, would be yet further confirmation that evolutionary theory is likely correct. So far, evolutionary theory has been subjected to literally countless tests like this, large and small, and the vast majority of results have confirmed the evolutionary prediction. This track record is hard to explain if evolution is an invalid theory, as some assert...
Finally, let's look over the primate DNA and mutation map (relative to each other):
TAC CTG GTG GGG GTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC HUMAN TAC CTG GTG GGG CTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC CHIMPANZEE TAC CCG GTG GGG GTG CGC TTC ACC CAG AG* GAT GAC GTC CTA CTG AGC CCC ORANGUTAN TAA CCG GTG GGG GTG CGC TTC ACC CAA GG* GAT GAC ATC ATA CTG AGC CCC MACAQUE --1 -1- --- --- 1-1 --- --- --- 111 1-- --- --- 1-- 1-- --- --- ---Evolutionary theory predicts that because the GLO gene is "broken" in primates (i.e. is a pseudogene), mutations in it are highly likely to be neutral (i.e., make no difference, since it can't get much more broken), and thus mutations are just as likely to accumulate at any site as any other. Is that what we see? Yup. There's no obvious pattern to the mutations between primates in the above mutation map, and unlike the pig/cow/mouse/rat mutation map, the mutations aren't predominantly at the "safer" third base of a codon, nor of a type that would be "safe". In fact, one base has vanished entirely, but no biggie, the gene's already broken.
Also, although primates share a more recent common ancestor than cows/pigs/mice/rats, note that they've already racked up almost as many relative mutations as the cow/pig/mouse/rat DNA. This too is just as evolutionary theory predicts, because many mutations in a functional gene (GLO in this case) will be "non-safe" and weeded out by natural selection, making for a slower mutation fixation rate overall than in a pseudogene (as GLO is in primates) where natural selection doesn't "care" about the vast majority of mutations since *most* are neutral. So pseudogenes accumulate mutations faster than functional genes (even though rate of mutation *occurence* in both are likely the same).
Finally, note that there are ZERO mutational differences between the human DNA and the chimpanzee DNA, our nearest living relative.
I also see some interesting implications in the DNA sequences concerning which specific mutation fixed during what branch of the common-descent evolutionary tree for all the species represented, but reconstructing that would not only take another couple hours, at least, but would be a major bear to code in HTML, since I'd have to draw trees with annotations on the nodes... Bleugh.
In any case, I hope I've clarified some of the methods by which biologists find countless confirmations of evolution in DNA data. This is just a "baby" example, and to be more statistically valid would have to be done over much vaster sections of DNA sequences, but my intent was to demonstrate some of the concepts.
And if such a small amount of DNA as this can make small confirmations of evolutionary predictions, imagine the amount of confirmation from billion-basepair DNA data from each species compared across thousands of species... The amount of confirmatory discoveries for evolution from DNA analysis has already been vast, and promises to only grow in the future. For an overview of some of the different lines of evidence being studied, see The Journal of Molecular Evolution -- abstracts of all articles, current and back issues, can be browsed free online.
Grrr... Okay, make that "one". At this time of morning my tired eyes couldn't see the difference between a "C" and a "G" the first time around.