FT2PHY
by Dave Hamm, Franklin, Ohio - copyright 2009
Conversion of Family Tree DNA 'repeat' data format files into 'ATGC' format files.
Designed to convert FTDNA data to ATGC format for use in other genetic software utilities.
Research
through Genetics
QUICK START:
From a Windows command line window (Start/run/cmd):
<ft2phy> <name of input file>
As in:
ft2phy HAM_to_paste_into_McGee.txt
The ft2phy program does NOT have a GUI interface, so you need to run it from a command line window.
OVERVIEW:
Overview for the use with PHYLIP "DNAPARS," "CONSENSE," PHYML, Tree Puzzle, LAMARC, etc.
FT2PHY is a set of program to convert individual DYS repeat values given by FTDNA into "ATGC" format for use with genetics programs, or any other program requiring the ATGC structure in what is known as the "PHYLIP" format.
FT2PHY was primarily written for use with LAMARC, but with
some editing, the output can be used with any number of genetics
programs that require the "PHYLIP" format.
That basic format is usually of the form:
The PHYLIP web site is located at:
http://evolution.genetics.washington.edu/phylip.html
Lamarc
does some Bayesian or
Likelihood analysis with multiple genomic regions,
recombination rates, and migration rates.
The LAMARC web site:
http://evolution.gs.washington.edu/lamarc/index.html
MEGA software is used for generating Phylogenetic and Network Tree (diagrams) from the
PHYLIP data.
At this time, I do not have a GUI interface for this program.
Installing FT2PHY:
The ft2phy program does NOT have a GUI interface, so you need to run it from a command line window.
RUNNING THE FT2PHY PROGRAM:
To start a "command line window"
from Microsoft Windows:
- click on "START"
- click on "RUN"
- type in "cmd" (without the quotes)
- click on "OK"
a command line window
should appear.
Then, you need to "cd" to change directory to where the conversion routines are located.
cd /D E:\Data\DNA
when done, you can return to your default directory with:
cd /D C:
To create the converted values for DYS439, then, run the program from a command line
interface, and provide the number of repeats as an argument:
ft2phy <name of your data file>
As in:
ft2phy HAM_to_paste_into_McGee.txt
Output
ft2phy will produce output into a directory called "ft2phy_files." That
directory should contain 37 ATGC data files, one file for each DYS
translated (example: ft2phy_CDYa.phy, ft2phy_CDYb.phy,
ft2phy_DYS19.phy, etc.)
It was designed this way because LAMARC is currently my favorite
program, and LAMARC's GUI data converter can import these files.
However, most genetic software programs that use PHYLIP compatible
input data require the data to be placed in one file. So, for most
programs, you will need to create one input file by adding the data
from each of the 37 files by use of a text editor (i.e., notepad).
These individual DYS files will be in ATGC PHYLIP format.
For example:
29 253
40777_WmVA
ACTACTGAGTTTCTGTTATAGTGTTTTTTAATATATATATAGTATTATATATAGTGTTATATATATATA
GTGTTTTAGATAGATAGATAGGTAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGTGACACTCT
CCTTAACCCAGATGGACTCCTTGTCCTCACTACATGGCCATGGCCCGAAGTATTACTCCTGGTGCCCCAGCCACTATTTC
CAGGTGCAGAGATTGACCAT????
68140_WmVA
ACTACTGAGTTTCTGTTATAGTGTTTTTTAATATATATATAGTATTATATATAGTGTTATATATATATA
GTGTTTTAGATAGATAGATAGGTAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGTGACACTCT
CCTTAACCCAGATGGACTCCTTGTCCTCACTACATGGCCATGGCCCGAAGTATTACTCCTGGTGCCCCAGCCACTATTTC
CAGGTGCAGAGATTGACCAT????
(...etc.)
ft2phy will produce quite a bit of
verbose output,
which I have used for a few quality checks. Please ignore the verbose
output, as the data is stored in files. I will remove the verbose
output at a later date.
ft2phy will create a genetic
Distance table in the style of Dean McGee's Y-DNA Comparison Utility.
However, the calculated Genetic Distance is not quite correct. No
attempt has been made to improve this. Please use Dean McGee's Utility
for a more accurate Genetic Distance table.
ft2phy will try to produce output for each kit that it finds in the
data file. However, if a Project member has only tested for 12 markers,
that particular kit will ONLY be found in the data files for the first
12 markers. (For example, it will be absent from the files with 25 or
37 markers tested.)
Most genetic problems will have a problem trying to interpret data
that does not have a consistent number of kits. That is, if you have 12
marker kits mixed in with your 37 marker data file, then most genetic
software programs will complain about it.
Therefore, it is important to remember to edit your data for a
consistent number of kits.
Which is to say, for best results, only use kits that have been tested
for 37 markers. Or, only use kits that have been tested for 25 markers,
etc. All of the kits in your input file to the "ft2phy" program should
contain the same number of markers tested. The exception to this is
that you can use 37 marker data with 67 marker data, because "ft2phy"
will ignore the input data beyond 37 markers.
Data entry into the PHYLIP "DNAPARS" program, the format should look like this:
----------------------------------------------------------------------------------
GENERAL Procedure:
Convert the FTDNA data into the format for use with Dean McGee's
Y-DNA Utility, and place this into a text file.
The FTDNA data for a surname project should be given in this format:
where:
- the "repeats" part of the FTDNA data would be the
familiar DYS repeat values.
The program should accept 67 marker data, but will only process up to
37 markers.
----------------------------------------------------------------------------------
The PHYLIP web site is located at:
http://evolution.genetics.washington.edu/phylip.html
LAMARC
LAMARC can accept files for each of the
37
markers. That is to say, it can accept 37 different files in order to
get the information for
each individual kit. The LAMARC GUI data converter will accept PHYLIP
format files for this. So, if you are using LAMARC, this program should
save an enormous amount of time just in data entry.
If you have a large Project, ft2phy
currently has a limit of 600 lines of data (the number of lines of data
that the "ft2phy" can accept). Ideally,
the data file should contain the same number of markers for each kit
listed.
Finally, I have not included the data format from other vendors who may test different DYS values than does Family Tree DNA.
The steps to make the output useful involve:
- If you have stored your data for use with Dean McGee's Y-DNA
Comparison Utility, then:
Where the text file "HAM_to_paste_into_McGee.txt"
would be data used to cut-and-paste into Dean McGee's Utility.
This file should only include kits that have been tested for 37 markers
or more.
Example:
ft2phy HAM_to_paste_into_McGee.txt
The data should be ready for input to LAMARC's GUI data converter. You will need to edit within LAMARC's GUI for region names, type (DNA), and you will most likely want to merge as one population. Then, save your data off from within the LAMARC GUI converter. Use the saved data from LAMARC's GUI converter in order to run LAMARC.
(The "ft2phy" program will not generate an XML file that is
compatible with LAMARC.)
For example, the data can be used with the Lamarc program, if you describe the data to be
"DNA" in "Phylip" format for the Lamarc GUI data converter. See:
http://evolution.gs.washington.edu/lamarc/index.html
Lamarc does some Bayesian or Likelihood analysis with multiple genomic regions, recombination rates, and migration rates. The Lamarc web site also has the packages Migrate, Coalesce, Fluctuate, and Recombine. Recombine needs mapping information to be reliable, but you can run the ATGC conversion data as “DNA” within the Lamarc program, say using these options within the “gui_lam_conv” utility:
- PHYLIP format
- DNA data
- Genomic Region: the DYS #
- Population: Your DNA Project Group#
Franklin, Ohio
HAM Surname DNA Project Coordinator
email: odoniv (at) yahoo.com
URL: http://ham-country.com/HamCountry/HAMCountry.html