This markdown note describes how I used several software programs to process and eventually analyze SNPs from ddRAD tag loci in SNAPP (Bryant et al. 2012). The data were generated using Next-Generation Sequencing on ddRADseq genomic libraries prepped for several species/lineages of Neotropical freshwater fishes from the Brazilian Cerrado (Central Brazil).
My account is given in first person and represents merely one way to analyze data in SNAPP
; there are other approaches, and other documents (e.g. the Leaché et al. BDF* tutorial doc) also present a general approach. However, all the brief SNAPP guides and tutorials that are currently available require the user to consult the manual, A Rought Guide to SNAPP
, written by Bouckaert and Bryant. Since SNAPP is amply covered by Bryant et al. (2012), Leaché et al. (2014), and other papers, I'll skip the introduction to SNAPP and assume the reader is acquainted with the details of the method, input, and output.
-
I converted raw NGS data in
.bam
format to.fastq
format, using samtools. This was done on Ion Torrent data, and may not be applicable to other platforms or non-second-generation sequencers. -
I conducted quality checks on fastq files using the software program,
fastqc
. -
pyRAD
assembly and processingpyRAD
output.- I ran fastq files in
pyRAD
under different parameter settings, then chose a finalpyRAD
run, and its corresponding output assembly, stats, and output files in various phylogenetic and population genetic formats (e.g. ".phy", ".nex", ".str", ".geno", ".snps"). - I specified for
pyRAD
to output all single nucleotide polymorphisms (SNPs) in phylip format, and also an unlinked .snps file to work from. - Before going further, I removed the outgroup from each datafile so that only ingroup samples were included in subsequent analyses (so I had the original SNPs file and the new file with no outgroups).
- I ran fastq files in
-
I downloaded the
phrynomics
R
package and loaded the full set of SNP loci (with variable and non-variable sites), which had the ".snps" file extension, intophyrnomics
, where I then-
- removed nonbinary sites (
"RemoveNonBinary"
),
- removed nonbinary sites (
-
- identified and pulled out a single SNP from each locus to use in the analyses (
"TakeSingleSNPfromEachLocus"
),
- identified and pulled out a single SNP from each locus to use in the analyses (
-
- translated bases to appropriate format/coding (
"TranslateBases"
), and 4) wrote modified SNPs to a NEXUS file with missing data coded as ?'s.
- translated bases to appropriate format/coding (
- I also took advantage of this opportunity to convert my original SNPs data to
MrBayes
and RAxML format within phrynomics. - All of my
R
code for step #4 here is in a file named "Hh_and_Hyp_Phrynomics_R_code... txt" in the bin folder I usually use as my R working directory...
-
-
I then modified the
NEXUS
file so that the datatype was specified as integer, i.e. by setting"FORMAT DATATYPE=integerdata symbols="012" gap=- MISSING=?;"
in the nexus header. This way the data was set to be discrete. -
I modified each individual taxon name so that it was preceded by a code with the format "Species_Taxon", where Species = the hypothetical species delimitation of the taxon. I did this because
SNAPP
expects this format, and it will automatically pull the Species data from the filenames and use it to help set up the species tree and population parameters in the analysis, e.g. branch lengths and thetas. The final datafile was obviously a nexus file with extension ".nex". -
I opened a new SNAPP template in
BEAUti
2.4.2 and added the final SNAPP-formatted data file as an alignment, then set up different runs with different priors/models and saved them as different XML files (input files forBEAST
), with different names.
-
I used my shell script
"BEASTRunner_newRand.sh"
to automate creating run folders and queue submission shell scripts for five runs of each XML file, each starting from a different random starting seed selected by calling on python to generate a random number. -
I manually simultaneously submitted all jobs to the supercomputer by logging in with passwordless ssh access and pasting, at once, commands to submit all of the jobs using qsub.
One of the tricks to getting SNAPP
to work for you is to experiment with different models and prior settings, in order to find the most suitable priors for your given dataset. For example, for some datasets, the default SNAPP settings will work fine and result in nicely converged runs with parameter traces that reached stationarity. However, I'm pretty sure that for many other datasets, the SNAPP defaults will not produce good results. For some of my analyes, fixing u
and/or v
, or setting uninformative uniform priors within reasonable bounds, was necessary to achieve good sampling properties (nice, "fuzzy caterpillar" parameter traces and good ESS scores).
- Bouckaert et al. (2014)
- Bouckaert and Bryant (REF)
- Bryant et al. (2012)
- Demos et al. (2015)
- Drummond et al. (2012)
- Leaché et al. (REF)
- Stange et al. (2017). bioarxiv.
Hi Justin, I'm having this problem when I try to translate my phylip alignment to the nexus binary format:
Do you know what could be the problem? Looks like a problem with the matrices but I dont know how to solve it.
Thanks !!