Replies

[The poster known as Mr. LLLICHY wrote:] Here is that Vitamin C data

After discovering this same data on another thread along with more discussion than has appeared here (I've taken the liberty of pinging the participants of that discussion), I see what the "mystery" is supposed to be -- it's supposed be why did some sites have multiple mutations while (small) stretches of other sites had none? In other words, why do the mutations appear clustered?

(You know, it would really help if people explained their points and questions in more detail, instead of leaving people to guess what the poster was thinking...)

[LLLICHY wrote:] "U238" that decays thrice, pretty good trick when there is "U238" that does not decay at all in 50,000,000 years.

Actually, no site had mutations "thrice". Three different bases at a given site is only *two* mutations (one original base, plus two mutations from it to something else).

Here's the "mutation map" from the actual DNA data:

--1-12--1-1-1-1--------1112112--1---1-11-1--------1 ALL/n

No mutations ("-") in about half the sites, one mutation at several (17) sites, two mutations at three sites.

The first thing to keep in mind that random processes tend to "cluster" more than people expect anyway. People expect "randomness" to "spread out" somewhat evenly, but instead it's usually more "clumped", for statistical reasons that would be a diversion to go into right now. So "that looks uneven" isn't always a good indication that something truly is non-random.

If you don't believe me on that, I wrote a program which made 23 mutations totally at random on a 51-site sequence, then repeated the process to see what different random outcomes would look like:

10 X$=STRING$(51,"-")
20 FOR I=1 TO 23
30 J%=INT(RND*51)+1
40 C$=MID$(X$,J%,1)
50 IF C$="-" THEN MID$(X$,J%,1)="1" ELSE MID$(X$,J%,1)=CHR$(ASC(C$)+1)
60 NEXT I
70 PRINT X$
80 GOTO 10

Yeah, it's BASIC, so sue me. Here's a typical screenful of the results:

-21---1---2---111----2-----2-1121-------1---1--11-1
-1--1--21-11---1-1--1-1---1----1---21-11111---11---
3-11---3-----1-----11-2-1---1--1----3--2---1--1----
---1-1--22--1-1--2-2111--1-1111---1------1-------1-
---32----1-11-1-----1---2-231----1------1-----11--1
----2---21--1---4----1-------------11-1--111-11-211
11--1-1---1-----1--1------1----3111--1----111-2-1-2
1112---1-3-1----1-1-----1-1------121--111-------1-1
-111121--1----1----1-1-1-1-11-2---1-1-------1-111--
-----------11-1---11-11--------21----12211--1---131
--1-211-1-1----21--11-1-2----1--1----11---11-----11
12---1-13------------2---21-21---11-1-1-1--2-------
-----2-1---1-1----21--11-11-1---111-1--111-----2--1
-----1-----1-1-1-1---1-2----11-21-11--1-111---1-21-
---11--1-1-122-1-1-1--1-----2-1-1-1-------1-1---111
--2--11----2--1---12-2----1-1---1-1--1--12----1-1-1
-111-1-----1-1----------1-21111--1-2-11-11-1----11-
11-1--211-1221-----1--1-----11--1-2-1----------11--
-----1-12-11---2-1---11--1-2--1----11---111-1----11
11----1--12---12----1---31---1-11----2--1-11-1-----
---1--111-1--1-1-111----1-21----1-1-3---1------2--1
-2-11----1-1------1------2-1-1--111-111-1-1----1111
1--1--1-1---1-111111--2--1-1------112----2---11----

Notice how oddly "clustered" most of them look, including one run which left a 13-site stretch "absolutely untouched", contrary to intuition (while having *4* mutations at a single site!)

Frankly, I don't see anything in the real-life DNA mutation map which looks any different from these truly random runs. Random events tend to cluster more than people expect. That solves the "mystery" right there.

Also, there may be a selection factor -- the GLO gene is a *lot* bigger than this. One has to wonder if this small 51-bp section was presented just because it was the one that looked "least random". That would be a no-no, since one can always hand-select the most deviant subset out of larger sample in order to artificially skew the picture.

However, since there are some interesting evolutionary observations to be made, let's look at that DNA data again, slightly rearranged:

TAC CCC GTG GAG GTG CGC TTC ACT CGG GCG GAC GAC ATC CTG CTG AGC CCC  PIG
TAC CCC GTG GAG GTA CGC TTC ACT CGC GGG GAC GAC ATC CTG CTG AGC CCC  BOS

TAC CCC GTA GAG GTG CGC TTC ACC CGA GGC GAT GAC ATT CTG CTG AGC CCC  RAT
TAC CCC GTG GAG GTG CGC TTC ACC CGA GGT GAT GAC ATC CTG CTG AGC CCG  MOUSE

TAC CCT GTG GGG GTG CGC TTC ACC CGG GGG GAC GAC ATC CTG CTG AGC CCC  GUIN PIG

TAC CTG GTG GGG GTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC  HUMAN
TAC CTG GTG GGG CTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC  CHIMPANZEE
TAC CCG GTG GGG GTG CGC TTC ACC CAG AG* GAT GAC GTC CTA CTG AGC CCC  ORANGUTAN
TAA CCG GTG GGG GTG CGC TTC ACC CAA GG* GAT GAC ATC ATA CTG AGC CCC  MACAQUE

Here I've put spaces between codons, and clustered the closely-related species together: pig/cow as ungulates, rat/mouse for their obvious relationship, guinea pig right below them but separated because of the pseudogene nature of its GLO gene, then primates all in a group, with man's closest relative, the chimp, immediately below him, followed by the more distant orangutan, and the even more distant macaque. Also note that the top four have "working" GLO genes, and the bottom five have "broken" GLO pseudogenes.

First, let's consider just the four species with working GLO genes. Evolution predicts that even over large periods of time, these genes will be "highly conserved", with natural selection weeding out mutations that could "break" the gene. Note that the mutations will still have occurred in individuals of the population, but natural selection will "discourage" that mutation from spreading into the general population.

And before we go any further, let's talk about the "universal genetic code". In all mammals (indeed, in almost all living organisms), each triplet of DNA sites cause a particular amino acid to be formed. The mapping of triplets (called "codons") to amino acids is as follows:

Second Position of Codon

T

C

A

G

F
i
r
s
t

P
o
s
i
t
i
o
n

T

TTT	Phe	[F]
TTC	Phe	[F]
TTA	Leu	[L]
TTG	Leu	[L]

TCT	Ser	[S]
TCC	Ser	[S]
TCA	Ser	[S]
TCG	Ser	[S]

TAT	Tyr	[Y]
TAC	Tyr	[Y]
TAA	Ter	[end]
TAG	Ter	[end]

TGT	Cys	[C]
TGC	Cys	[C]
TGA	Ter	[end]
TGG	Trp	[W]

T

C

A

G

T
h
i
r
d

P
o
s
i
t
i
o
n

C

CTT	Leu	[L]
CTC	Leu	[L]
CTA	Leu	[L]
CTG	Leu	[L]

CCT	Pro	[P]
CCC	Pro	[P]
CCA	Pro	[P]
CCG	Pro	[P]

CAT	His	[H]
CAC	His	[H]
CAA	Gln	[Q]
CAG	Gln	[Q]

CGT	Arg	[R]
CGC	Arg	[R]
CGA	Arg	[R]
CGG	Arg	[R]

T

C

A

G

A

ATT	Ile	[I]
ATC	Ile	[I]
ATA	Ile	[I]
ATG	Met	[M]

ACT	Thr	[T]
ACC	Thr	[T]
ACA	Thr	[T]
ACG	Thr	[T]

AAT	Asn	[N]
AAC	Asn	[N]
AAA	Lys	[K]
AAG	Lys	[K]

AGT	Ser	[S]
AGC	Ser	[S]
AGA	Arg	[R]
AGG	Arg	[R]

T

C

A

G

GTT	Val	[V]
GTC	Val	[V]
GTA	Val	[V]
GTG	Val	[V]

GCT	Ala	[A]
GCC	Ala	[A]
GCA	Ala	[A]
GCG	Ala	[A]

GAT	Asp	[D]
GAC	Asp	[D]
GAA	Glu	[E]
GAG	Glu	[E]

GGT	Gly	[G]
GGC	Gly	[G]
GGA	Gly	[G]
GGG	Gly	[G]

T

C

A

G

(The above table imported from http://psyche.uthct.edu/shaun/SBlack/geneticd.html, which also has a nice introduction to the genetic code.)

Another version of the same table with nifty Java features and DNA database lookups can be found here.

The thing which is most relevant to the following discussion is the fact that most of the genetic codes are "redundant" -- more than one codon (triplet) encodes to exactly the same amino acid. This means that even in genes which are required for the organism, certain basepair mutations make absolutely no difference if the change is from one codon which maps into amino acid X to another codon which still maps into amino acid X. (This fact allows certain kinds of evolutionary "tracers" to be "read" from the DNA, as described here).

Now back to our DNA data. The redundancy in the genetic code means that some basepair sites will have more "degrees of freedom" than others (i.e., ways in which they can mutate without disrupting the gene's biological function in any way). Let's look at the four species with working GLO genes again:

TAC CCC GTG GAG GTG CGC TTC ACT CGG GCG GAC GAC ATC CTG CTG AGC CCC  PIG
TAC CCC GTG GAG GTA CGC TTC ACT CGC GGG GAC GAC ATC CTG CTG AGC CCC  BOS
TAC CCC GTA GAG GTG CGC TTC ACC CGA GGC GAT GAC ATT CTG CTG AGC CCC  RAT
TAC CCC GTG GAG GTG CGC TTC ACC CGA GGT GAT GAC ATC CTG CTG AGC CCG  MOUSE
  T   T   T   A   T A T   T   T A T   C   C   T   T T T T T   T   T
      A   A       A   A       A   C   A           A   A   A       A
      G   C       G   G       G   G   G               C   C       C
--- --- --1 --- --1 --- --- --1 --2 -12 --1 --- --1 --- --- --- --1

Under each site of the mouse DNA, I've listed the "alternative" bases which could be be substituted for the mouse base at that site WITHOUT ALTERING THE GENE'S FUNCTION (because of genetic code redundancy). And under that I show the "mutation map" of just those four species.

Note that most of the "alternative" bases are in the third base of each codon, *and* that this is where all but one of the mutations have appeared. This is because these were the sites which were "free" to mutate in the way they did, because the mutation was genetically neutral. That doesn't mean that the first and second sites of each codon were immune from mutation, it's just that when mutations did occur at those sites, natural selection weeded them out quickly because they most likely "broke" the GLO gene for the individuals which received that mutuation. What we see above is the results after natural selection has already "filtered" the undesirable mutations and left the ones which "do no harm".

Additionally, the two sites which have mutated twice (i.e. have a "2" in the mutation map) are ones which had more "allowable" mutations. Also note that the sites which had the fewest allowable alternatives (only one alternate letter allowed) didn't have any mutations fix at those sites, which is unsurprising since a "safe" mutation would be less likely to occur there versus a site that "allowed" two or three alternatives.

All this is as predicted by evolutionary theory, you'll note.

It also explains the one anomoly of the original mutation map, which is that the mutation counts do tend to be higher at the third base of a codon.

However... What about the one exception? The pig DNA has had one mutation at a site which does not encode to exactly the same amino acid (which is the case for *all* the other ones). In the pig DNA, the GGG codon (mapping to Glycine) has changed to a GCG codon (mapping to Alanine). What's up with that? Well, one of two things. First and most likely, just as base values in codons have a built-in redundancy, so do the amino acids which make up the proteins which result from the DNA templates. In other words, certain amino acids can be substituted for other ones at some sites in given proteins without making any functional difference. (This "protein functional redundancy" also has implications for "evolutionary tracer" analysis, see here.) That may well be the case for Alanine versus Glycine in the GLO protein, but I'm not enough of a biochemist to be able to say. The other option is that it *does* make some difference in the function of the pig GLO protein, but not enough to "break" the vitamin-C synthesis (as proven by the fact that pigs *can* synthesize vitamin C). So one way or another, it's not a deal-breaker even though pig GLO will not be 100% identical to cow/mouse/rat GLO. It's yet another "allowable" mutation.

More interesting evolutionary observations: The number of mutational differences between pig/cow is 3, the number between mouse/rat is 4, and the difference between rat/cow is 7 -- all roughly as one would expect from the evolutionary relatedness of these animals (cows/pigs and rats/mice are each closer to each other than the rodents are to the ungulates).

Now let's take a close look at the guinea pig:

TAC CCT GTG GGG GTG CGC TTC ACC CGG GGG GAC GAC ATC CTG CTG AGC CCC  GUIN PIG
--- --1 --- -1- --- --- --- --- --1 --1 --1 --- --- --- --- --- ---

The "mutation map" under the guinea pig DNA is compared to the mouse DNA. Fascinating: Note that four of the five mutations are in the third base of a codon, *and* are of the type "allowed" by the genetic code redundancy. This indicates strongly that most of the evolutionary divergence between guinea pigs and mice likely occurred while the guinea pig's ancestors still had a working GLO gene. This is the sort of prediction implied by the evolutionary theory which could be cross-checked by further research of various types, and if verified, would be yet further confirmation that evolutionary theory is likely correct. So far, evolutionary theory has been subjected to literally countless tests like this, large and small, and the vast majority of results have confirmed the evolutionary prediction. This track record is hard to explain if evolution is an invalid theory, as some assert...

Finally, let's look over the primate DNA and mutation map (relative to each other):

TAC CTG GTG GGG GTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC  HUMAN
TAC CTG GTG GGG CTA CGC TTC ACC TGG AG* GAT GAC ATC CTA CTG AGC CCC  CHIMPANZEE
TAC CCG GTG GGG GTG CGC TTC ACC CAG AG* GAT GAC GTC CTA CTG AGC CCC  ORANGUTAN
TAA CCG GTG GGG GTG CGC TTC ACC CAA GG* GAT GAC ATC ATA CTG AGC CCC  MACAQUE
--1 -1- --- --- 1-1 --- --- --- 111 1-- --- --- 1-- 1-- --- --- ---

Evolutionary theory predicts that because the GLO gene is "broken" in primates (i.e. is a pseudogene), mutations in it are highly likely to be neutral (i.e., make no difference, since it can't get much more broken), and thus mutations are just as likely to accumulate at any site as any other. Is that what we see? Yup. There's no obvious pattern to the mutations between primates in the above mutation map, and unlike the pig/cow/mouse/rat mutation map, the mutations aren't predominantly at the "safer" third base of a codon, nor of a type that would be "safe". In fact, one base has vanished entirely, but no biggie, the gene's already broken.

Also, although primates share a more recent common ancestor than cows/pigs/mice/rats, note that they've already racked up almost as many relative mutations as the cow/pig/mouse/rat DNA. This too is just as evolutionary theory predicts, because many mutations in a functional gene (GLO in this case) will be "non-safe" and weeded out by natural selection, making for a slower mutation fixation rate overall than in a pseudogene (as GLO is in primates) where natural selection doesn't "care" about the vast majority of mutations since *most* are neutral. So pseudogenes accumulate mutations faster than functional genes (even though rate of mutation *occurence* in both are likely the same).

Finally, note that there are ZERO mutational differences between the human DNA and the chimpanzee DNA, our nearest living relative.

I also see some interesting implications in the DNA sequences concerning which specific mutation fixed during what branch of the common-descent evolutionary tree for all the species represented, but reconstructing that would not only take another couple hours, at least, but would be a major bear to code in HTML, since I'd have to draw trees with annotations on the nodes... Bleugh.

In any case, I hope I've clarified some of the methods by which biologists find countless confirmations of evolution in DNA data. This is just a "baby" example, and to be more statistically valid would have to be done over much vaster sections of DNA sequences, but my intent was to demonstrate some of the concepts.

And if such a small amount of DNA as this can make small confirmations of evolutionary predictions, imagine the amount of confirmation from billion-basepair DNA data from each species compared across thousands of species... The amount of confirmatory discoveries for evolution from DNA analysis has already been vast, and promises to only grow in the future. For an overview of some of the different lines of evidence being studied, see The Journal of Molecular Evolution -- abstracts of all articles, current and back issues, can be browsed free online.

Finally, note that there are ZERO mutational differences between the human DNA and the chimpanzee DNA, our nearest living relative.

Grrr... Okay, make that "one". At this time of morning my tired eyes couldn't see the difference between a "C" and a "G" the first time around.

Nice! You put a lot of work into your posts.

Your post 2110.

WOW!!

Ichneumon’s #2110:

http://www.freerepublic.com/focus/news/963744/posts?page=2110#2110