GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2003-11 > 1069467522
Subject: Re: [DNA] Genetic Distance calculation -- message from Bruce Walsh. He asked me to post it to the list
Date: Fri, 21 Nov 2003 21:18:51 -0500 (EST)
In-Reply-To: <3FBE77A0.email@example.com> (message from Charles on Fri, 21Nov 2003 15:37:52 -0500)
> My point is that, to do so we need to follow a formal probability model,
> otherwise our intuition can be misleading.
Just so. Beware of hidden assumptions in the formal probability model.
> John's point is correct in that, GIVEN we know how many mutations have
> occurred, if 2 have occurred then a match or an off-by two are equally
> likely. However, the problem is not this, but rather the opposite:
> given an
> observed state (say an observed difference of two), how many actual
> mutations have occurred?
This is a fair statement of the problem, but this is also where we
part company. Bruce takes this statement of the problem as a license
to impose prior knowledge of the distribution of the actual distance
in time between the two test subjects. The knowledge he chooses to
impose is simply that the two test subjects are in fact closely
related. Given that knowledge, it should come as no surprise that his
calculation results in a small genetic distance. (To be fair, I must
point out that "close" in his terms means that the expected number of
mutations is less than one for each locus.) All I can say is that I
don't have the advantage of knowing in advance that any two people are
closely related, even if they happen to have the same surname. If the
genealogical research suggests that they are close, but the DNA
testing shows them to be surprisingly far apart, then I have to take
into account the possibility of a non-paternal event.
To put it another way: I have no quarrel with Bruce's formula as a
formula, but it presumes to know in advance the thing that we are all
trying to discover, namely, the closeness of relationship between two
test subjects. In fact, as he points out, by extending the summation
to larger and larger allowable numbers of generations, he can get as
large a genetic distance as he pleases. The fact that he chooses to
get 2.1 is just the result of his own arbitrary choice.
Here is a description that shows why the difference is not the
quantity to consider. Unfortunately, it involves probability
theory, and so some readers may not be prepared to wade through
it. Still, it has the advantage of putting my assumptions out
in the open. (If you want to skip over it, just look for the next
1. The outcome of each mutation opportunity is an independent, random
event with possible values of +1, 0, and -1. I assume +1 and -1 are
equally likely (because we are concerned with the difference between
2. After some generations have elapsed, the outcome is just the
sum of the individual outcomes; the difference between the two
people is, in turn, the sum of the one outcome and the negative
of the other. (I phrase it in this odd way because of the next
3. Two simple theorems from probability: the variance of the negative
of a random variable is equal to the variance of the variable itself;
and the variance of the sum of two or more random variables is equal
to the sum of the variances of the individual variables.
4. This means that the expectation of the variance of the offset
(considered as a random variable) between two persons grows linearly
with time from the point of departure from their MRCA. In other
words, the genetic distance (counted as the number of generations) is
proportional to the expected variance.
5. The only concrete estimate we have of the expected variance is the
actually observed variance. (This is where I make my shakiest
assumption -- since we have only one measurement of the variance,
there is no guarantee that it even comes close to the expected value.
Still, it avoids the necessity of circular reasoning.)
6. By the way, the variance is the square of the difference.
> A simple example can make the case: Suppose very few generations have
> passed, but we still see a two-step difference. It is FAR more likely that
> only two mutations have occurred (both in the plus direction) than the
> much more unlikely event of four mutations.
Bruce is explicit here. He comes right out and asserts his assumption
that very few generations have passed. What he is glossing over is
the corrollary that (by his assumption) even just TWO mutations are
very unlikely. In other words, the case he is really building is that
this is probably a LAB ERROR. (Actually, although we have excluded
multi-step mutations from consideration for the sake of argument, the
object of this exercise is to find a description that approximates
reality. Therefore, the answer of choice in this case would be a
As I have said many times before, and as I'm sure Bruce would agree,
the case of a 24/25 match with a two-step difference on the 25th
marker is special. The fact that all the other loci give a hint of a
close relationship does indeed support Bruce's assumption. And
reality does intervene to support the notion that the difference
should perhaps be viewed as a two-step mutation. On the other hand,
let's look at the case that we were talking about just yesterday: the
match was 19/25, and the differences were 2, 2, 2, 1, 1, and 1. Under
the circumstances, it would be absurd to assume the two individuals
are closely related. The sum of the squares is the only reasonable
approximation to the genetic distance in this case.