Brant Faircloth brantfaircloth

Assembly

download files from s3 using 3hub
unzip respective files (started w/ Bin001.zip)
run process_reads.py (now part of https://github.com/faircloth-lab/illumiprocessor/):
```
python ~/git/brant/seqcap/Assembly/process_reads.py
```
run velvetoptimiser:

This is primarily directed towards preparing large amounts of UCE data for Genbank. However, parts of the following should work with most NGS data sets and other types of sequence data. Programs within phyluce are availble from:

https://github.com/faircloth-lab/phyluce

Sequin will trim vector contamination, but Sequin will also not handle huge files (nor do you want to have it try). So, the vector screening portions below attempt to be equivalent to this process.

Calling SNPs with GATK in non-model taxa

These notes build from several excellent sources:

and assume you're working with GATK 2.2-16. These notes also assume

Install dependencies and Casava

The following assumes you are converting BCL files containing PE100 reads with a 10 nt index read. You can allow Casava to demultiplex for you or do it on your own, later. You can adjust values below if you are doing something different (e.g. shorter reads, longer indexes) but be careful.

You need a pretty beefy machine. Illumina recommends something with multiple cores and 48 GB RAM, running Centos 5. Centos 6 also works just fine. See their recommendations here:

# start the instance:

ec2-run-instances --key /path/to/my/ec2-keypair ami-74f0061d --instance-type=c1.xlarge --block-device-mapping '/dev/sda2=ephemeral0' --block-device-mapping '/dev/sda3=ephemeral1'

# mount the ephemeral storage:

sudo su
mkdir /mnt/data
mount /dev/sda2 /mnt/data

start up ARDAgent (on remote machine via ssh):

sudo /System/Library/CoreServices/RemoteManagement/ARDAgent.app/Contents/Resources/kickstart -activate \
    -configure -users bcf -access -on -restart -agent -privs -all -allowAccessFor -specifiedUsers

start tunnel (from local to remote):

ssh -i keyfile -NfL 1202:127.0.0.1:5900 [email protected]

connect w/ (on local machine):

	import numpy

	s = '40 40 40 40 40'
	sl = s.rstrip().split(' ')
	si = [int(elem) for elem in sl]
	sa = numpy.array(si)

	from Bio.Nexus import Nexus

	aln = Nexus.Nexus()
	aln.read('my-properly-formatted-nexus-file.nex')

	# assuming your partitions are defined in a charset block like:
	#
	# begin sets;
	# charset bag2 = 1-186;
	# charset bag3 = 187-483;

	#!/usr/bin/env python
	# encoding: utf-8
	"""
	File: mpi_sate.py
	Author: Brant Faircloth

	Created by Brant Faircloth on 04 May 2012 15:05 PDT (-0700)
	Copyright (c) 2012 Brant C. Faircloth. All rights reserved.

	Description:

	import os
	import tempfile
	from mpi4py import MPI

	comm = MPI.COMM_WORLD
	size = comm.Get_size()
	rank = comm.Get_rank()

	mode = MPI.MODE_RDONLY