GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2003-11 > 1069469440
From: Charles <>
Subject: Re: [DNA] Genetic Distance calculation -- which method is best
Date: Fri, 21 Nov 2003 21:50:44 -0500
References: <3FBE77A0.email@example.com> <REME20031121211843@alum.mit.edu>
Since Dr. Bruce Walsh is not subscribed to this list did you also send
your reply to him via his email address posted at the end of his
message? Bruce told me he was going to contact you via private email
and he also invited anyone wishing to debate his method of calculating
Genetic Distance to contact him directly. Here is his email address.
As genealogist and a Surname Project Coordinator with FamilyTreeDNA.com
for several DNA projects, all I want is for the estimates of Genetic
Distance for the Y chromosome test in the project coordinator pages on
the FamilyTreeDNA website to be as accurate as possible given the
knowledge we currently have regarding the mutation process and mutation
rates. Possibly your input to FamilyTreeDNA via a dialog directly with
Dr. Bruce Walsh would be helpful to that end.
> Bruce wrote:
>>My point is that, to do so we need to follow a formal probability model,
>>otherwise our intuition can be misleading.
> Just so. Beware of hidden assumptions in the formal probability model.
>>John's point is correct in that, GIVEN we know how many mutations have
>>occurred, if 2 have occurred then a match or an off-by two are equally
>>likely. However, the problem is not this, but rather the opposite:
>>observed state (say an observed difference of two), how many actual
>>mutations have occurred?
> This is a fair statement of the problem, but this is also where we
> part company. Bruce takes this statement of the problem as a license
> to impose prior knowledge of the distribution of the actual distance
> in time between the two test subjects. The knowledge he chooses to
> impose is simply that the two test subjects are in fact closely
> related. Given that knowledge, it should come as no surprise that his
> calculation results in a small genetic distance. (To be fair, I must
> point out that "close" in his terms means that the expected number of
> mutations is less than one for each locus.) All I can say is that I
> don't have the advantage of knowing in advance that any two people are
> closely related, even if they happen to have the same surname. If the
> genealogical research suggests that they are close, but the DNA
> testing shows them to be surprisingly far apart, then I have to take
> into account the possibility of a non-paternal event.
> To put it another way: I have no quarrel with Bruce's formula as a
> formula, but it presumes to know in advance the thing that we are all
> trying to discover, namely, the closeness of relationship between two
> test subjects. In fact, as he points out, by extending the summation
> to larger and larger allowable numbers of generations, he can get as
> large a genetic distance as he pleases. The fact that he chooses to
> get 2.1 is just the result of his own arbitrary choice.
> Here is a description that shows why the difference is not the
> quantity to consider. Unfortunately, it involves probability
> theory, and so some readers may not be prepared to wade through
> it. Still, it has the advantage of putting my assumptions out
> in the open. (If you want to skip over it, just look for the next
> quoted passage.)
> 1. The outcome of each mutation opportunity is an independent, random
> event with possible values of +1, 0, and -1. I assume +1 and -1 are
> equally likely (because we are concerned with the difference between
> two people).
> 2. After some generations have elapsed, the outcome is just the
> sum of the individual outcomes; the difference between the two
> people is, in turn, the sum of the one outcome and the negative
> of the other. (I phrase it in this odd way because of the next
> 3. Two simple theorems from probability: the variance of the negative
> of a random variable is equal to the variance of the variable itself;
> and the variance of the sum of two or more random variables is equal
> to the sum of the variances of the individual variables.
> 4. This means that the expectation of the variance of the offset
> (considered as a random variable) between two persons grows linearly
> with time from the point of departure from their MRCA. In other
> words, the genetic distance (counted as the number of generations) is
> proportional to the expected variance.
> 5. The only concrete estimate we have of the expected variance is the
> actually observed variance. (This is where I make my shakiest
> assumption -- since we have only one measurement of the variance,
> there is no guarantee that it even comes close to the expected value.
> Still, it avoids the necessity of circular reasoning.)
> 6. By the way, the variance is the square of the difference.
>>A simple example can make the case: Suppose very few generations have
>>passed, but we still see a two-step difference. It is FAR more likely that
>>only two mutations have occurred (both in the plus direction) than the
>>much more unlikely event of four mutations.
> Bruce is explicit here. He comes right out and asserts his assumption
> that very few generations have passed. What he is glossing over is
> the corrollary that (by his assumption) even just TWO mutations are
> very unlikely. In other words, the case he is really building is that
> this is probably a LAB ERROR. (Actually, although we have excluded
> multi-step mutations from consideration for the sake of argument, the
> object of this exercise is to find a description that approximates
> reality. Therefore, the answer of choice in this case would be a
> two-step mutation.)
> As I have said many times before, and as I'm sure Bruce would agree,
> the case of a 24/25 match with a two-step difference on the 25th
> marker is special. The fact that all the other loci give a hint of a
> close relationship does indeed support Bruce's assumption. And
> reality does intervene to support the notion that the difference
> should perhaps be viewed as a two-step mutation. On the other hand,
> let's look at the case that we were talking about just yesterday: the
> match was 19/25, and the differences were 2, 2, 2, 1, 1, and 1. Under
> the circumstances, it would be absurd to assume the two individuals
> are closely related. The sum of the squares is the only reasonable
> approximation to the genetic distance in this case.
> John Chandler