FT2DNA


by Dave Hamm, Franklin, Ohio - copyright 2006

 Conversion of Family Tree DNA 'repeat' data format into 'ATGC' format.

HAM DNA ProjectHAM Surname DNA Project

Research through Genetics


Contact Locations Become a Contact Alabama Contacts Georgia Contacts Illinois Contacts Indiana Contacts Kansas Contacts Kentucky Contacts Missouri Contacts North Carolina Contacts New Hampshire South Carolina Contacts Tennessee Contacts Virginia Contacts England Contacts
All Locations General Links HAM web pages Kentucky North Carolina Tennessee Virginia West Virginia Belgium France Britain (England) Netherlands Scotland
Submit a new Query All Locations Alabama Arkansas Georgia Iowa Illinois Kentucky Louisiana Minnesota Missouri Mississippi North Carolina New York Ohio Oklahoma South Carolina Tennessee Texas Virginia Canada (NB) England France
All Estates Kentucky Estates North Carolina Estates Tennessee Estates Virginia Estates England Estates France Estates
Welcome DNA Project DNA Project Goals Intro to DNA Testing Participation ft2dna program ft2phy program Y-DNA Results Results at FTDNA Dean McGee's Output HAM DNA Phylograms HAM Group01 Y-Search HAM Group02 Y-Search HAM Group04 Y-Search HAM Group07 Y-Search Participating Families DNA Links DNA Tools
Share



 Designed to convert FTDNA data to ATGC format for analysis by other utilities.


QUICK START:


From a Windows command line window (Start/run/cmd):


   <DYS_program_name> <# of repeats>


As in:


   DYS439 11

 or,

   DYS439 11 >DYS439_output.txt


Download FT2DNA   version 1.3 package   (compiled with GCC for Windows XP)



OVERVIEW:


  Overview for the use with PHYLIP "DNAPARS," "CONSENSE" etc.


FT2DNA is a set of program to convert individual DYS repeat values given by FTDNA into "ATGC" format for use with the PHYLIP suite of programs, or any other program requiring the ATGC structure.


Not for the faint of heart, this procedure takes some time. Be prepared for a grueling experience. The first run through may be fairly frustrating for the average genealogist.

 

At this time, I do not have a GUI interface for these programs. Nor have I written a comprehensive program that would accept more than one argument. So, for 37 markers, you will need to run the 37 different routines to get the data converted, for each individual kit. If you have 300 participants in your surname project, this will take some time. (You may want to recruit some of your participants for the data conversion procedure.) Obviously, another feature lacking is that no markers beyond the first 37 FTDNA makers are translated here.


Finally, I have not included the data format from other vendors who may test different DYS values than does FTDNA.


The steps to make the output useful involve:


1) Run FT2DNA on each individual DYS value for each individual DNA

     participant.

2) Run the PHYLIP package program "SEQBOOT."

3) Run another PHYLIP package program, such as "DNAPARS," for example.

4) Run the PHYLIP package program, "CONSENSE."


  Once you have the "SEQBOOT" output, then you can run several other PHYLIP programs against the data, such as DNAPARS, DNADIST, KITSCH, CONTML, etc.

  Once you have the CONSENSE output, then you can run more PHYLIP package programs, such as TREEDIST.


As for PHYLIP, another example (as derived from Felsenstein’s documentation) would be:


1) Run FT2DNA on each individual DYS value for each individual DNA participant.

2) Run the PHYLIP package program "SEQBOOT."

3) Run the PHYLIP package program "DNADIST."

4) Run the PHYLIP package program "KITSCH."

5) Run the PHYLIP package program "CONSENSE."


This would be an independent approach to using Dean McGee's Y-DNA Utility for Kitsch.        Obviously, Dean McGee's Utility would save a lot of time in deriving a Kitsch graph.


An example to compute differences, as given by Felsenstein would be:


1) Run FT2DNA on each individual DYS value for each individual DNA participant.

2) Run the PHYLIP package program "SEQBOOT."

3) Run GENDIST to compute the Nei (1972) genetic distance.

4) Run CONTML

5) Run KITSCH to make a tree.


With CONTML, the units of length in the output are amounts of expected accumulated variance (and NOT time).


----------------------------------------------------------------------------------

I have designed the FTDNA translations into ATGC format to enable me to run the programs within the PHYLIP package. However, later I plan to explore any number of other software packages, but the basic idea here is to get started with the basic conversion.


The PHYLIP web site is located at:


   http://evolution.genetics.washington.edu/phylip.html

If you want to move away from the PHYLIP packages, you can use SEQBOOT ("J" option for rewrite) to convert the data into NEXUS format. The MEGA program can then convert this data into FASTA format, and so on. 




RUNNING THE FT2DNA PROGRAMS:





Each program is designed to convert one STR at a time. For example, the "DYS439.exe" program will only convert the repeat values represented by DYS439.

Also, the programs do NOT have a GUI interface, so you need to run them from a

command line window.


-----------------------------------------------------
To Create an icon for the Windows Desktop:
-----------------------------------------------------

Find the MS-DOS prompt:
From your desktop find the "Start Button" on the lower left corner.
Click on:        Start > Programs > Accessories
The MS-DOS Prompt is in the Accessories Menu.
Right Click the MS-DOS Prompt Icon and drag it to your desktop and create a shortcut on your Desktop.

- Right click on the newly created Desktop icon and select "Rename."  Rename it to FT2DNA.
- Right click on the newly created Desktop icon again and select "Properties."
   Under the "Shortvut" tab, there is a box for "Start in:"  _____
   Type the full path to the FT2DNA executables in this "Start in:" box.

     example:     K:\Data\Genealogy\DNA\FT2DNA1.2

- Click on "Apply."

If you wish to change the command line options for font color or backgraound colors, select the "Colors" trab.
(For example, I use a grey background with a blue font.)

When you are done, click on "Apply" then "OK"
 
-------------------------------------------------
If you have not created an icon:
-------------------------------------------------


To start a "command line window" from Microsoft Windows:

 - click on "START"

 - click on "RUN"

 - type in "cmd" (without the quotes)

 - click on "OK"


 a command line window should appear.

Then, you need to "cd" to change directory to where the conversion routines are located.


  cd /D E:\Data\DNA


  when done, you can return to your default directory with:


  cd /D C:


To create the converted values for DYS439, then, run the program from a command line

interface, and provide the number of repeats as an argument:


   <DYS_program_name> <# of repeats>


As in:


   DYS439 11


To capture the output from the command line window, re-direct the output using the ">"

re-direction symbol, as in:


   DYS439 11 >DYS439_output.txt


At this time, I do not have a GUI interface for these programs.

Nor have I written a comprehensive program that would accept more than one argument.

So, for 37 markers, you will need to run the 37 different routines to get the data converted,

for each individual kit. If you have 300 participants in your surname project, this will take

some time.




CONVERSION:


The format of the ATGC conversion data is:


   taxa [number of data characters in shortest line]

  [prefix][repeats][suffix]


 - where:


   [prefix] is from the Sorenson web page, and looks like (DYS439, for example): TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAA


   The prefix does not repeat. However, the repeat value (given as "GATA") does repeat.


 - The number of repeats is given by the FTDNA data.

   For example, [repeats] for DYS439 is given by: GATA

   And, if the FTDNA data gives "11" for DYS439, then the [repeats] string would look like this:


     GATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATA


   That is, "GATA" repeated 11 times.

  [suffix] is from the Sorenson web page, and looks like this (DYS439, for example):


  GAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC


so, for data entry into the PHYLIP "DNAPARS" program, the format looks like this:


1 45

MEMBER GAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC....

----------------------------------------------------------------------------------

A word about the PHYLIP file formats -


PHYLIP uses two file formats, one is called "sequential," and the other "interleaved."


To show an example from the LAMARC software documentation, interleaved looks like

this:


3 30

Bob ATTGTCTACG

Frank ACTGTCTACG

Mark ACTGTCAACG

TTCCGTCTGGATATTGTGTG

TTACGTCTGGATATTGTGTC

TTCCGTCTGAATATTGTGTG


That is, the first lines of the data include the name or label, then the lines to follow

include the information for each label, in the same order.


A second data format is "sequential," and looks like this:


3 20

Bob ATTGTCTACG TTCCGTCTGGATATTGTGTG

Frank ACTGTCTACG TTACGTCTGGATATTGTGTC

Mark ACTGTCAACG TTCCGTCTGAATATTGTGTG


That is, all of the data for each label is presented until a new label is encountered.

(I will be using the "sequential" format in this documentation.)


Be careful with blank characters, PHYLIP format can also be picky about blank characters at the end of lines or at the end of sequences. It also does not want to see blank lines between the data entries. It is usually OK with blanks within the data, but can be difficult to debug if a blank is at the end of a line.

----------------------------------------------------------------------------------

  As a completed example, a data file for DYS439 for three members might look like this:



3 213

40777 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC

42370 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC

41641 TCGAGTTGTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCTATCTTGAATTAATAGATTCAAGGTGATAGATATACAGATAGATAGATACATAGGTGGAGACAGATAGATGATAAATAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGAAAGTATAAGTAAAGAGATGATGGGTAAAAGAATTCCAAGCCAC


 - where "3" is the number of taxa, and "213" is the length of the shortest line.

   The above would represent 11 repeats on DYS439 for kits 40777 and 43270,

   and 12 repeats for kit #41641.


   Again, this data is ONLY for the STR data for a single column entry (DYS439).

   Therefore, each DYS value should be entered separately.


   And the data should be ready for the SEQBOOT program for replication.

----------------------------------------------------------------------------------




Once you have generated (perhaps) 37 individual DYS valuies using FT2DNA, then you can run it through:



1) SEQBOOT and make a file with at least 100 bootstrapped data sets.

2) then run DNADIST using the output of SEQBOOT as its input

3) then run (say) KITSCH using the output of DNADIST as its input

4) then run CONSENSE using the tree file from KITSCH as its input.




SUMMARY:


In this example for DYS439, participants 43270 and 40777 have the same data. Participant 41641 has data within a different haplotype group. (You will not want to mix haplotype groups. This is an example to show the difference in output.)


- The first part of the output is simply a display of the input file.

- Next, it gives a character display of the most parsimonious tree.

- Next is a count of the number of steps to obtain the most parsimonious tree.

- Next is a table showing the number of steps at each site given in the "ATCG" input format.

- Finally, a display of the states of the resulting tree.


   (This indicates the steps that are required to obtain the most parsimonious tree for the taxa within the input file.)


To make sense of this output, you can to transform the "steps in each site" from the "most parsimonious tree" back into the FTDNA format. Or, the data can be used in other packages.


Multiple output trees could also be run against the PHYLIP "CONSENSE" program to generate a "consensus" tree.


Or, the output could be used as input to the PHYLIP "DNAPENNY" program. (You will NOT want to use the DNAPENNY package for more than 10 individuals because of the length of compute time it takes.) THE DNAPENNY program uses branch-and-bound to find all most parsimonious trees.


Or, the DYS data conversion to ATGC output could be compared to that from that obtained by using Dean McGee’s Y-DNA Utility. For example, it might be interesting to run TREEDIST as in:


1) KITSCH against Dean McGee's Y-DNA Utility output, as compared to

2) ATGC output run through SEQBOOT, DNADIST, KITSCH, and CONSENSE


 using TREEDIST to determine if the "Branch Score" has any relevance to known genealogical information.


You could use the PHYLIP program "SEQBOOT" to convert the data into another format which can then be used or recognized by other packages. (i.e., NEXUS format)

For example, the data can be used with the Lamarc program, if you describe the data to be

"DNA" in "Phylip" format for the Lamarc GUI data converter. See:


   http://evolution.gs.washington.edu/lamarc/index.html



Other Programs:


Lamarc does some Bayesian or Likelihood analysis with multiple genomic regions, recombination rates, and migration rates. The Lamarc web site also has the packages Migrate, Coalesce, Fluctuate, and Recombine. Recombine needs mapping information to be reliable, but you can run the ATGC conversion data as “DNA” within the Lamarc program, say using these options within the “gui_lam_conv” utility:


 - PHYLIP format

 - DNA data

 - Genomic Region: DYS #

 - Population: DNA Project Group#


The program PHYML takes a PHYLIP-like ATGC data format, and claims to be quicker than similar programs.
You can find PHYML at:

       http://atgc.lirmm.fr/phyml/



FT2DNA QUALITY CHECKS:


I have tested the programs for quality, using the following checks, when readily

available:


 - forward primers as given by NIST

 - occasionally, complement primers are also shown on the output

    In these cases, I have compared the examples given by the Sorenson web site.

 - embedded nucleotides

 - repeated characters match the number of repeats given

 - bp sums (or sizes) correspond to the values given by NIST

 - in some cases, I have run a BLAST search against the data in order to determine

   if the data given matches reality.

 - In all cases, I have compared the output to NIST standards. In some cases, I have

   compared the output as given by Sorenson and the GDB Human Genome Database.


  (Perhaps what is needed is some mapping information, besides the GUI....)           




Known problems:


  Problems remain getting DYS448 with correct sums or sizes.

  Problems with sizes for DYS607, as that information is not readily available to me.

  Problems with DYS values that can be represented in more than one way. For these

  cases, I have added comments to the output in an attempt to clarify.

  - Examples would be DYS447, DYS448, and perhaps DYS389ii.




 - Dave Hamm   


Franklin, Ohio

HAM Surname DNA Project Coordinator

email: odoniv (at) yahoo.com

URL: http://ham-country.com/HamCountry/HAMCountry.html


Back to HAM Country