GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-11 > 1288621822


From: Ann Turner <>
Subject: [DNA] Provenance of a DNA segment (importance of phased haplotypes)
Date: Mon, 1 Nov 2010 07:30:22 -0700


In another thread, I wrote:

On Fri, Oct 29, 2010 at 6:47 AM, Ann Turner <> wrote:
> I suspect many if not most of the segments at the 5th cousin level come
> from the general gene pool, not the one ancestral couple that has been
> identified. For example, I have in hand one set of results for a cousin who
> matches a parent and child. Here's a list of segments found in the child,
> and whether or not they are found in the parent.
>
> Y 8.95
> Y 2.75
> Y 2.60
>
> N 5.01
> N 4.50
> N 3.82
> N 3.20
> N 2.72
> N 2.60
> N 2.29
> N 2.24
> N 1.25


For the sake of discussion, let's specify that the parent is the mother. The
cousin naturally wanted to know if he might be related on the father's side,
even if a match wasn't displayed because of the absence of a large DNA
segment. [Empirically, the minimum segment size for Family Finder seems to
be about 7.7 cM. If anyone has cases where the longest block is shorter than
that, I'd be interested in learning those numbers.]

Since I had data for the father/mother/child trio, I thought it might be
interesting use phased data for Jr's comparison to his cousin. This is a bit
tedious to do with the tools I have at hand, so I just looked at the longest
segment where the provenance was not the mother. This 5.01 cM segment would
also be displayed at 23andMe's Relative Finder if there is another segment
at least 7 cM long.

Using genotype data (where we don't know which allele came from which
parent), this 5.01 cM segment had five fragments, broken up by four isolated
mismatches (the two parties are opposite homozygotes, e.g. AA and CC).
Family Finder tolerates occasional mismatches, which could be due to
genotyping error or microdeletions.

When I made a pseudo-genotype using just the alleles from one parent or the
other, the 5.01 cM segment was broken up into 23 fragments on the father's
side and 40 fragments on the mother's side. It just happened that sometimes
Jr's paternal alleles didn't match the cousin, but the maternal alleles
*appeared* to fill in the gaps and created a longer segment. The cousin's
data was not phased, so the matching segments in this exercise are probably
even more fragmented.

Phased haplotypes are not feasible for many people, at least at this stage
of the game where father/mother/child trios are the easiest approach.
Genotype data is a decent fall-back, but it can be quite noisy. I think the
take-home lesson here is that we must allow ample opportunity for opposite
homozygotes to crop up and "spoil" a long continuous run of SNPs. If we want
to be confident about the provenance of a segment, we need to set higher
thresholds for number of SNPs and cM. Those small segments may be even
smaller and less significant than we realize. Of course, this is just one
example (the first one I looked at, though, and I suspect it is not an
isolated case).

Ann Turner

P.S. I used a phasing tool developed by Alex Bisignano:

http://www.chromosomechronicles.com/2009/09/30/use-family-snp-data-to-phase-your-own-genome/

It was not designed for the FTDNA file format, but you can tweak the raw
download OK:

1) Load the CSV file into Excel 2007.

2) Replace the no-calls (triple dashes) with NN.

3) Replace any deletions with D. -A will become DA, -C will become DC, -G
will become DG, -T will become DT, and -- (double dash) will become DD.
Excel seems to think that values starting with a dash are intended to be
formulas.

4) Save the file as txt (tab-delimited).

The output file will use NN for ambiguous haplotypes (if all three parties
are heterozygous, you can't tell which parent contributed which allele). NN
thus becomes a universal match and doesn't spoil a long continuous run.


This thread: