GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2003-11 > 1069447072
From: Charles <>
Subject: [DNA] Genetic Distance calculation -- message from Bruce Walsh. He asked me to post it to the list
Date: Fri, 21 Nov 2003 15:37:52 -0500
Bruce asked me to post this message to the list. I invited him to
briefly join the list and debate the subject here directly rather than I
be in the middle. I hope he does. He also said that anyone could also
email him directly. His email address is at the end of his message.
I'm a genealogist not a math titan. :-) All I want to do is get it
settled whether the method used in the FamilyTreeDNA website of
calculating Genetic Distance is correct and/or is John Chandler's
advocated method of calculating Genetic Distance more correct ... and
thus FamilyTreeDNA should correct the algorithm used to calculate
Genetic Distance in their website pages used by surname project group
Here is the message from Bruce Walsh:
Could you please post this on the appropriate list. Many thanks
"Make things as simple as possible --- but no simpler" Einstein
The following is a little technical in places. However, the ideas are
John Chandler has raised some interesting points recently on this list,
some of which I agree with, others not.
First, John is correct in that what we are trying to count is the total
number of actual mutations. This plus the mutate rate sets the time.
Hence, we are always trying to estimate the actual number of mutations given
some observed difference in marker score.
My point is that, to do so we need to follow a formal probability model,
otherwise our intuition can be misleading. John replies:
>>>>Unfortunately, that result is nonsense. The example is, in fact,
enough to explain to the whole list and requires no Bessel functions.
Consider two individuals who have between them actually experienced
exactly two mutations relative to a common ancestor. .. Therefore, we
two equally likely cases: either the two mutations canceled each other
out, giving an observed difference of 0, or the two mutations
reinforced, giving an observed difference of 2.<<
John's point is correct in that, GIVEN we know how many mutations have
occurred, if 2 have occurred then a match or an off-by two are equally
likely. However, the problem is not this, but rather the opposite:
observed state (say an observed difference of two), how many actual
mutations have occurred? This is a standard Markov-chain problem, a
common model used in probability for modeling all sorts of things. For
example, the probability that two alleles are off by 2k steps given a
2M mutations have occurred is just
Pr(2K | 2M) = 2 (1/2)^(2M) Bi(2M, M-k)
where Bi(N,k) = n!/ [ (N-k)!k! ] is the binomial coefficient and n! =
n*(n-1)*(n-2)* .. * 1 is
the factorial of n. Note that this recovers John's result when 2M=1
(i.e., 2 mutations)
However, the probability of seeing two alleles off by 2k after t
generations depends on both this probability and the mutation rate, as
Pr(2k | t) = sum (over M) Pr(2K | 2M)*Pr(2M | t)
The probability of a total of 2M mutations in t generations is
Pr(2M | t) = Exp(-2ut) (2ut)^2M/(2M)!
the resulting infinite sum of the product of these two probabilities over
all appropriate values of 2M is 2 Exp(-2ut) Bess(2k, 2ut), where
bess(k,x) is the value for the kth-order type I bessel function
evaluated at x.
Hence, Bessel functions arise from summing the appropriate series.
The probability of interest is Pr(2M | 2k) --- given we see a difference
of 2k, that is the probability that 2M mutations have actually occurred?
My impression of what others have said is that John
wishes to argue that the expected number of mutations for individuals off
by two steps is closer to 4 than to two. What John is correctly doing is
trying to count the actual number of mutations. The problem with his
squaring logic is that the math is different.
A simple example can make the case: Suppose very few generations have
passed, but we still see a two-step difference. It is FAR more likely that
only two mutations have occurred (both in the plus direction) than the
much more unlikely event of four mutations. To formally compute these,
Bayes' theorem for conditional probability:
Pr(2M | 2k, t) = Pr(2k | 2M, t) * Pr(2M | t) / Prob(2k | t)
We have values for all of the expressions above. Just plug them in.
For "small" values of t (relative to the mutation rate), i.e., the time
scale of 50-200 generations, the expected (average) value of M given we
observe a two-step change is roughly 2.1. However, for very large
t, say 2ut = 4, then the expected value is closer to 4. For very, very
large values of t, say 2ut = N >> 1, then the expected value is closer to
its important to close by stating that John and I agree more than
disagree. In particular
"There is, however, more to the story. The stepwise model does NOT give
a correct picture because it doesn't allow for two-step mutations" I
completely agree with this.
The good news is that the above analysis can be extended when we have
good estimates of the actual mutation rates.
Likewise, John is trying to do the correct thing, which is to someone
count the actual number of mutations, the correct (and a very smart)
do. The problem is that his metric is not appropriate, as it does not
suitably count things.
Associate Professor and Associate Department Head
(Associate Editor, Genetics)
Department of Ecology and Evolutionary Biology
|[DNA] Genetic Distance calculation -- message from Bruce Walsh. He asked me to post it to the list by Charles <>|