Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save justincbagley/86b8c28280abc2a71e90bbd652dc9eb1 to your computer and use it in GitHub Desktop.
Save justincbagley/86b8c28280abc2a71e90bbd652dc9eb1 to your computer and use it in GitHub Desktop.
Notes on Analyzing ddRADseq SNP data in SNAPP (BEAST module)

Notes on Analyzing ddRADseq SNP Data in SNAPP (BEAST v2.4++)

This markdown note describes how I used several software programs to process and eventually analyze SNPs from ddRAD tag loci in SNAPP (Bryant et al. 2012). The data were generated using Next-Generation Sequencing on ddRADseq genomic libraries prepped for several species/lineages of Neotropical freshwater fishes from the Brazilian Cerrado (Central Brazil).

My account is given in first person and represents merely one way to analyze data in SNAPP; there are other approaches, and other documents (e.g. the Leaché et al. BDF* tutorial doc) also present a general approach. However, all the brief SNAPP guides and tutorials that are currently available require the user to consult the manual, A Rought Guide to SNAPP, written by Bouckaert and Bryant. Since SNAPP is amply covered by Bryant et al. (2012), Leaché et al. (2014), and other papers, I'll skip the introduction to SNAPP and assume the reader is acquainted with the details of the method, input, and output.

Data Preparation

  1. I converted raw NGS data in .bam format to .fastq format, using samtools. This was done on Ion Torrent data, and may not be applicable to other platforms or non-second-generation sequencers.

  2. I conducted quality checks on fastq files using the software program, fastqc.

  3. pyRAD assembly and processing pyRAD output.

    • I ran fastq files in pyRAD under different parameter settings, then chose a final pyRAD run, and its corresponding output assembly, stats, and output files in various phylogenetic and population genetic formats (e.g. ".phy", ".nex", ".str", ".geno", ".snps").
    • I specified for pyRAD to output all single nucleotide polymorphisms (SNPs) in phylip format, and also an unlinked .snps file to work from.
    • Before going further, I removed the outgroup from each datafile so that only ingroup samples were included in subsequent analyses (so I had the original SNPs file and the new file with no outgroups).
  4. I downloaded the phrynomics R package and loaded the full set of SNP loci (with variable and non-variable sites), which had the ".snps" file extension, into phyrnomics, where I then

      1. removed nonbinary sites ("RemoveNonBinary"),
      1. identified and pulled out a single SNP from each locus to use in the analyses ("TakeSingleSNPfromEachLocus"),
      1. translated bases to appropriate format/coding ("TranslateBases"), and 4) wrote modified SNPs to a NEXUS file with missing data coded as ?'s.
    • I also took advantage of this opportunity to convert my original SNPs data to MrBayes and RAxML format within phrynomics.
    • All of my R code for step #4 here is in a file named "Hh_and_Hyp_Phrynomics_R_code... txt" in the bin folder I usually use as my R working directory...

Input File Generation

  1. I then modified the NEXUS file so that the datatype was specified as integer, i.e. by setting "FORMAT DATATYPE=integerdata symbols="012" gap=- MISSING=?;" in the nexus header. This way the data was set to be discrete.

  2. I modified each individual taxon name so that it was preceded by a code with the format "Species_Taxon", where Species = the hypothetical species delimitation of the taxon. I did this because SNAPP expects this format, and it will automatically pull the Species data from the filenames and use it to help set up the species tree and population parameters in the analysis, e.g. branch lengths and thetas. The final datafile was obviously a nexus file with extension ".nex".

  3. I opened a new SNAPP template in BEAUti 2.4.2 and added the final SNAPP-formatted data file as an alignment, then set up different runs with different priors/models and saved them as different XML files (input files for BEAST), with different names.

Running SNAPP

  1. I used my shell script "BEASTRunner_newRand.sh" to automate creating run folders and queue submission shell scripts for five runs of each XML file, each starting from a different random starting seed selected by calling on python to generate a random number.

  2. I manually simultaneously submitted all jobs to the supercomputer by logging in with passwordless ssh access and pasting, at once, commands to submit all of the jobs using qsub.

Tricky Stuff

One of the tricks to getting SNAPP to work for you is to experiment with different models and prior settings, in order to find the most suitable priors for your given dataset. For example, for some datasets, the default SNAPP settings will work fine and result in nicely converged runs with parameter traces that reached stationarity. However, I'm pretty sure that for many other datasets, the SNAPP defaults will not produce good results. For some of my analyes, fixing u and/or v, or setting uninformative uniform priors within reasonable bounds, was necessary to achieve good sampling properties (nice, "fuzzy caterpillar" parameter traces and good ESS scores).

References

  • Bouckaert et al. (2014)
  • Bouckaert and Bryant (REF)
  • Bryant et al. (2012)
  • Demos et al. (2015)
  • Drummond et al. (2012)
  • Leaché et al. (REF)
  • Stange et al. (2017). bioarxiv.
@k-sanchez
Copy link

Hi Justin, I'm having this problem when I try to translate my phylip alignment to the nexus binary format:

#Prepare data for `SNAPP`

snps <- ReadSNP("alignment.snps.phy")
snps <- RemoveNonBinary(snps$data)

> In matrix(value, n, p) :
  data length [5046] is not a sub-multiple or multiple of the number of columns [4]

Do you know what could be the problem? Looks like a problem with the matrices but I dont know how to solve it.
Thanks !!

@npsonis
Copy link

npsonis commented Apr 10, 2020

Dear kis3990,
I had the same problem and I found that in the .phy input file you have to comment out (use a hashtag #) the first line where there is info about the data. So in your case, if you have for example 4 samples, it will be:
#4 5046.
Otherwise the first line is considered part of the data with the name of the first sample being "4".
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment