**GENEALOGY-DNA-L Archives**

From:Charles <>Subject:[DNA] Genetic Distance calculation -- message from Bruce Walsh. He asked me to post it to the listDate:Fri, 21 Nov 2003 15:37:52 -0500All:

Bruce asked me to post this message to the list. I invited him to

briefly join the list and debate the subject here directly rather than I

be in the middle. I hope he does. He also said that anyone could also

email him directly. His email address is at the end of his message.

I'm a genealogist not a math titan. :-) All I want to do is get it

settled whether the method used in the FamilyTreeDNA website of

calculating Genetic Distance is correct and/or is John Chandler's

advocated method of calculating Genetic Distance more correct ... and

thus FamilyTreeDNA should correct the algorithm used to calculate

Genetic Distance in their website pages used by surname project group

coordinators.

Charles

Here is the message from Bruce Walsh:

------------------

Charles:

Could you please post this on the appropriate list. Many thanks

bruce

"Make things as simple as possible --- but no simpler" Einstein

The following is a little technical in places. However, the ideas are

very straightforward.

John Chandler has raised some interesting points recently on this list,

some of which I agree with, others not.

First, John is correct in that what we are trying to count is the total

number of actual mutations. This plus the mutate rate sets the time.

Hence, we are always trying to estimate the actual number of mutations given

some observed difference in marker score.

My point is that, to do so we need to follow a formal probability model,

otherwise our intuition can be misleading. John replies:

>>>>Unfortunately, that result is nonsense. The example is, in fact,

simple

enough to explain to the whole list and requires no Bessel functions.

Consider two individuals who have between them actually experienced

exactly two mutations relative to a common ancestor. .. Therefore, we

have

two equally likely cases: either the two mutations canceled each other

out, giving an observed difference of 0, or the two mutations

reinforced, giving an observed difference of 2.<<

John's point is correct in that, GIVEN we know how many mutations have

occurred, if 2 have occurred then a match or an off-by two are equally

likely. However, the problem is not this, but rather the opposite:

given an

observed state (say an observed difference of two), how many actual

mutations have occurred? This is a standard Markov-chain problem, a

type of

common model used in probability for modeling all sorts of things. For

example, the probability that two alleles are off by 2k steps given a

total of

2M mutations have occurred is just

Pr(2K | 2M) = 2 (1/2)^(2M) Bi(2M, M-k)

where Bi(N,k) = n!/ [ (N-k)!k! ] is the binomial coefficient and n! =

n*(n-1)*(n-2)* .. * 1 is

the factorial of n. Note that this recovers John's result when 2M=1

(i.e., 2 mutations)

However, the probability of seeing two alleles off by 2k after t

generations depends on both this probability and the mutation rate, as

Pr(2k | t) = sum (over M) Pr(2K | 2M)*Pr(2M | t)

The probability of a total of 2M mutations in t generations is

Poisson-distributed,

Pr(2M | t) = Exp(-2ut) (2ut)^2M/(2M)!

the resulting infinite sum of the product of these two probabilities over

all appropriate values of 2M is 2 Exp(-2ut) Bess(2k, 2ut), where

bess(k,x) is the value for the kth-order type I bessel function

evaluated at x.

Hence, Bessel functions arise from summing the appropriate series.

The probability of interest is Pr(2M | 2k) --- given we see a difference

of 2k, that is the probability that 2M mutations have actually occurred?

My impression of what others have said is that John

wishes to argue that the expected number of mutations for individuals off

by two steps is closer to 4 than to two. What John is correctly doing is

trying to count the actual number of mutations. The problem with his

squaring logic is that the math is different.

A simple example can make the case: Suppose very few generations have

passed, but we still see a two-step difference. It is FAR more likely that

only two mutations have occurred (both in the plus direction) than the

much more unlikely event of four mutations. To formally compute these,

follow

Bayes' theorem for conditional probability:

Pr(2M | 2k, t) = Pr(2k | 2M, t) * Pr(2M | t) / Prob(2k | t)

We have values for all of the expressions above. Just plug them in.

For "small" values of t (relative to the mutation rate), i.e., the time

scale of 50-200 generations, the expected (average) value of M given we

observe a two-step change is roughly 2.1. However, for very large

value of

t, say 2ut = 4, then the expected value is closer to 4. For very, very

large values of t, say 2ut = N >> 1, then the expected value is closer to

N.

its important to close by stating that John and I agree more than

disagree. In particular

"There is, however, more to the story. The stepwise model does NOT give

a correct picture because it doesn't allow for two-step mutations" I

completely agree with this.

The good news is that the above analysis can be extended when we have

good estimates of the actual mutation rates.

Likewise, John is trying to do the correct thing, which is to someone

count the actual number of mutations, the correct (and a very smart)

thing to

do. The problem is that his metric is not appropriate, as it does not

suitably count things.

Cheers

Bruce Walsh

Associate Professor and Associate Department Head

(Associate Editor, Genetics)

Department of Ecology and Evolutionary Biology

email:

**This thread:**

- [DNA] Mutation rate and distant ancestors by "Nicholas Penington" <>
- Re: [DNA] Mutation rate and distant ancestors by
- Re: [DNA] Genetic Distance calculation method -- which method ismost correct? by Charles <>
- Re: [DNA] Genetic Distance calculation method -- which method is mostcorrect? by "Nicholas Penington" <>

**[DNA] Genetic Distance calculation -- message from Bruce Walsh. He asked me to post it to the list by Charles <>**

- Re: [DNA] Genetic Distance calculation -- which method is best by
- Re: [DNA] Genetic Distance calculation - Comments re MacGregor and a further question by "Richard McGregor" <>

- Re: [DNA] Genetic Distance calculation - Comments re MacGregor anda further question by (VON HAMRICK)

- Re: [DNA] Genetic Distance calculation method -- which method ismost correct? by Charles <>

- [DNA] mutation rate and distant ancestors correction by "Nicholas Penington" <>

- RE: [DNA] Mutation rate and distant ancestors by "Mike Harper" <>

- [DNA] Re:Mutation rate and distant ancestors by "Nicholas Penington" <>

- Re: [DNA] Mutation rate and distant ancestors by