FT2DNA
by Dave Hamm, Franklin, Ohio - copyright 2006
Conversion of Family Tree DNA 'repeat' data format into 'ATGC' format.
Research
through Genetics
Designed to convert FTDNA data to ATGC format for analysis by other utilities.
QUICK START:
From a Windows command line window (Start/run/cmd):
<DYS_program_name> <# of repeats>
As in:
DYS439 11
or,
DYS439 11 >DYS439_output.txt
OVERVIEW:
Overview for the use with PHYLIP "DNAPARS," "CONSENSE" etc.
FT2DNA is a set of program to convert individual DYS repeat values given by FTDNA into "ATGC" format for use with the PHYLIP suite of programs, or any other program requiring the ATGC structure.
Not for the faint of heart, this procedure takes some time. Be prepared for a grueling experience. The first run through may be fairly frustrating for the average genealogist.
At this time, I do not have a GUI interface for these programs. Nor
have I written a
comprehensive program that would accept more than one argument. So, for
37
markers, you will need to run the 37 different routines to get the data
converted, for
each individual kit. If you have 300 participants in your surname
project, this will take
some time. (You may want to recruit some of your participants for the
data conversion
procedure.) Obviously, another feature lacking is that no markers
beyond the first 37 FTDNA makers are translated here.
Finally, I have not included the data format from other vendors who may test different DYS values than does FTDNA.
The steps to make the output useful involve:
1) Run FT2DNA on each individual DYS value for each individual DNA
participant.
2) Run the PHYLIP package program "SEQBOOT."
3) Run another PHYLIP package program, such as "DNAPARS," for example.
4) Run the PHYLIP package program, "CONSENSE."
Once you have the "SEQBOOT" output, then you can run several other PHYLIP programs against the data, such as DNAPARS, DNADIST, KITSCH, CONTML, etc.
Once you have the CONSENSE output, then you can run more PHYLIP package programs, such as TREEDIST.
As for PHYLIP, another example (as derived from Felsenstein’s documentation) would be:
1) Run FT2DNA on each individual DYS value for each individual DNA participant.
2) Run the PHYLIP package program "SEQBOOT."
3) Run the PHYLIP package program "DNADIST."
4) Run the PHYLIP package program "KITSCH."
5) Run the PHYLIP package program "CONSENSE."
This would be an independent approach to using Dean McGee's Y-DNA Utility for Kitsch. Obviously, Dean McGee's Utility would save a lot of time in deriving a Kitsch graph.
An example to compute differences, as given by Felsenstein would be:
1) Run FT2DNA on each individual DYS value for each individual DNA participant.
2) Run the PHYLIP package program "SEQBOOT."
3) Run GENDIST to compute the Nei (1972) genetic distance.
4) Run CONTML
5) Run KITSCH to make a tree.
With CONTML, the units of length in the output are amounts of expected accumulated variance (and NOT time).
----------------------------------------------------------------------------------
I have designed the FTDNA translations into ATGC format to enable me to run the programs within the PHYLIP package. However, later I plan to explore any number of other software packages, but the basic idea here is to get started with the basic conversion.
The PHYLIP web site is located at:
http://evolution.genetics.washington.edu/phylip.html
If you want to move away from the PHYLIP packages, you can use SEQBOOT ("J" option for rewrite) to convert the data into NEXUS format. The MEGA program can then convert this data into FASTA format, and so on.
RUNNING THE FT2DNA PROGRAMS:
Each program is designed to convert one STR at a time. For example, the "DYS439.exe" program will only convert the repeat values represented by DYS439.
Also, the programs do NOT have a GUI interface, so you need to run them from a
command line window.
To start a "command line window"
from Microsoft Windows:
- click on "START"
- click on "RUN"
- type in "cmd" (without the quotes)
- click on "OK"
a command line window should appear.
Then, you need to "cd" to change directory to where the conversion routines are located.
cd /D E:\Data\DNA
when done, you can return to your default directory with:
cd /D C:
To create the converted values for DYS439, then, run the program from a command line
interface, and provide the number of repeats as an argument:
<DYS_program_name> <# of repeats>
As in:
DYS439 11
To capture the output from the command line window, re-direct the output using the ">"
re-direction symbol, as in:
DYS439 11 >DYS439_output.txt
At this time, I do not have a GUI interface for these programs.
Nor have I written a comprehensive program that would accept more than one argument.
So, for 37 markers, you will need to run the 37 different routines to get the data converted,
for each individual kit. If you have 300 participants in your surname project, this will take
some time.
CONVERSION:
The format of the ATGC conversion data is:
taxa [number of data characters in shortest line]
[prefix][repeats][suffix]
- where:
[prefix] is from the Sorenson web page, and looks like (DYS439, for example): TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAA
The prefix does not repeat. However, the repeat value (given as "GATA") does repeat.
- The number of repeats is given by the FTDNA data.
For example, [repeats] for DYS439 is given by: GATA
And, if the FTDNA data gives "11" for DYS439, then the [repeats] string would look like this:
GATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATA
That is, "GATA" repeated 11 times.
[suffix] is from the Sorenson web page, and looks like this (DYS439, for example):
GAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC
so, for data entry into the PHYLIP "DNAPARS" program, the format looks like this:
1 45
MEMBER GAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC....
----------------------------------------------------------------------------------
A word about the PHYLIP file formats -
PHYLIP uses two file formats, one is called "sequential," and the other "interleaved."
To show an example from the LAMARC software documentation, interleaved looks like
this:
3 30
Bob ATTGTCTACG
Frank ACTGTCTACG
Mark ACTGTCAACG
TTCCGTCTGGATATTGTGTG
TTACGTCTGGATATTGTGTC
TTCCGTCTGAATATTGTGTG
That is, the first lines of the data include the name or label, then the lines to follow
include the information for each label, in the same order.
A second data format is "sequential," and looks like this:
3 20
Bob ATTGTCTACG TTCCGTCTGGATATTGTGTG
Frank ACTGTCTACG TTACGTCTGGATATTGTGTC
Mark ACTGTCAACG TTCCGTCTGAATATTGTGTG
That is, all of the data for each label is presented until a new label is encountered.
(I will be using the "sequential" format in this documentation.)
Be careful with blank characters, PHYLIP format can also be picky about blank characters at the end of lines or at the end of sequences. It also does not want to see blank lines between the data entries. It is usually OK with blanks within the data, but can be difficult to debug if a blank is at the end of a line.
----------------------------------------------------------------------------------
As a completed example, a data file for DYS439 for three members might look like this:
3 213
40777 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC
42370 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC
41641 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC
- where "3" is the number of taxa, and "213" is the length of the shortest line.
The above would represent 11 repeats on DYS439 for kits 40777 and 43270,
and 12 repeats for kit #41641.
Again, this data is ONLY for the STR data for a single column entry (DYS439).
Therefore, each DYS value should be entered separately.
And the data should be ready for the SEQBOOT program for replication.
----------------------------------------------------------------------------------
Once you have generated (perhaps) 37 individual DYS valuies using FT2DNA, then you can run it through:
1) SEQBOOT and make a file with at least 100 bootstrapped data sets.
2) then run DNADIST using the output of SEQBOOT as its input
3) then run (say) KITSCH using the output of DNADIST as its input
4) then run CONSENSE using the tree file from KITSCH as its input.
SUMMARY:
In this example for DYS439, participants 43270 and 40777 have the same data. Participant 41641 has data within a different haplotype group. (You will not want to mix haplotype groups. This is an example to show the difference in output.)
- The first part of the output is simply a display of the input file.
- Next, it gives a character display of the most parsimonious tree.
- Next is a count of the number of steps to obtain the most parsimonious tree.
- Next is a table showing the number of steps at each site given in the "ATCG" input format.
- Finally, a display of the states of the resulting tree.
(This indicates the steps that are required to obtain the most parsimonious tree for the taxa within the input file.)
To make sense of this output, you can to transform the "steps in each site" from the "most parsimonious tree" back into the FTDNA format. Or, the data can be used in other packages.
Multiple output trees could also be run against the PHYLIP "CONSENSE" program to generate a "consensus" tree.
Or, the output could be used as input to the PHYLIP "DNAPENNY" program. (You will NOT want to use the DNAPENNY package for more than 10 individuals because of the length of compute time it takes.) THE DNAPENNY program uses branch-and-bound to find all most parsimonious trees.
Or, the DYS data conversion to ATGC output could be compared to that from that obtained by using Dean McGee’s Y-DNA Utility. For example, it might be interesting to run TREEDIST as in:
1) KITSCH against Dean McGee's Y-DNA Utility output, as compared to
2) ATGC output run through SEQBOOT, DNADIST, KITSCH, and CONSENSE
using TREEDIST to determine if the "Branch Score" has any relevance to known genealogical information.
You could use the PHYLIP program "SEQBOOT" to convert the data into another format which can then be used or recognized by other packages. (i.e., NEXUS format)
For example, the data can be used with the Lamarc program, if you describe the data to be
"DNA" in "Phylip" format for the Lamarc GUI data converter. See:
http://evolution.gs.washington.edu/lamarc/index.html
Other Programs:
Lamarc does some Bayesian or
Likelihood analysis with multiple genomic regions,
recombination rates, and migration rates. The Lamarc web
site also has
the packages
Migrate, Coalesce, Fluctuate, and Recombine. Recombine needs mapping
information
to be reliable, but you can run the ATGC conversion data as “DNA”
within the Lamarc
program, say using these options within the “gui_lam_conv” utility:
- PHYLIP format
- DNA data
- Genomic Region: DYS #
- Population: DNA Project Group#
FT2DNA QUALITY CHECKS:
I have tested the programs for quality, using the following checks, when readily
available:
- forward primers as given by NIST
- occasionally, complement primers are also shown on the output
In these cases, I have compared the examples given by the Sorenson web site.
- embedded nucleotides
- repeated characters match the number of repeats given
- bp sums (or sizes) correspond to the values given by NIST
- in some cases, I have run a BLAST search against the data in order to determine
if the data given matches reality.
- In all cases, I have compared the output to NIST standards. In some cases, I have
compared the output as given by Sorenson and the GDB Human Genome Database.
(Perhaps what is needed is some mapping information, besides the GUI....)
Known problems:
Problems remain getting DYS448 with correct sums or sizes.
Problems with sizes for DYS607, as that information is not readily available to me.
Problems with DYS values that can be represented in more than one way. For these
cases, I have added comments to the output in an attempt to clarify.
- Examples would be DYS447, DYS448, and perhaps DYS389ii.
- Dave
Hamm
Franklin, Ohio
HAM Surname DNA Project Coordinator
email: odoniv (at) yahoo.com
URL: http://ham-country.com/HamCountry/HAMCountry.html