Ira Cooke iracooke

Using Marine Omics Nextflow Pipelines

Installing java

First make sure you have a sufficiently modern java (11 or higher). You can find out with

java -version

If your java is too old you can try installing a newer java using sdkman.

Split a Fasta file

This method relies on bioawk . First make sure you have bioawk installed. Then download the file split_fasta.awk from this repository. Instructions below assume you have this file available in your working directory

Installing bioawk (instructions specific for JCU HPC)

Make a bin directory if you haven't already

cd ~
mkdir bin

The first port of call for info on the JCU HPC system is the official wiki . This gist is a supplement to the main wiki that provides some quick answers to common questions and links to this wiki as well as other useful resources.

This gist assumes that your local machine (ie your personal computer, not the HPC) is running a unix-like OS (macOS or linux). Windows users should consider setting up windows subsystem for linux so that they can also have a unix-like operating system to work with.

What is the JCU HPC system

It is a fairly substantial collection of high performance computers. At the time of writing this constituted 15 nodes each of which has 80 cpus and just under 400Gb of memory. All the nodes are networked together so that large jobs can be distributed across multiple nodes. A range of high capacity data storage is also networked to HPC accounts as [detailed here](ht

How to use MS to simulate data for PSMC/MSMC

The ms command usage looks like this

usage: ms nsam howmany

So it is nessary to provide nsam (The number of haplotypes to be sampled) and howmany which is the number of replicate sets of data to generate.

For PSMC data we always choose nsam to be 2 because the method is designed for diploid genomes. For convenience howmany should just be set to 1 because we will rerun ms to generate separate random replicate datasets

Run these commands from within the tutorial 2 directory

mkdir -p bin
touch bin/greet.sh

Copy content from greet.sh above into the file greet.sh that you just created

Change permissions

Running signalp on trinity/transdecoder output

Assuming we have a fasta file of proteins with ids generated from Trinity and Transdecoder called transdecoder.pep. Truncate names as follows.

cat transdecoder.pep | sed -r  's/[^:]*::/>/' > transdecoder_truncated.pep

Note that on a mac you should use -E instead of -r

Basic filtering for AGRF RADseq data

This data consists of vcf file output from Stacks. See this post for some info about this output.

The general filtering strategy is as follows;

Remove sites where the minor allele frequency is too low as these might also be the result of sequencing or alignment errors in a handful of individuals.
Remove individuals where the depth is too low. Ideally we would use a likelihood based scoring measure here instead (eg GQ field) but this is not provided by Stacks.

	cd
	mkdir -p ~/.local/bin
	mkdir -p ~/.local/lib

	git clone https://github.com/bats-core/bats-core.git

	cd bats-core
	./install.sh ~/.local/
	cd ~/.local/lib

	BEGIN { in_q=0 }

	/^```.*question_/ {
	in_q=1;
	match($0,"question_[a-z]*[0-9]+")
	printf("%s(){\n",substr($0,RSTART,RLENGTH));
	}


	/#Your answer here/ {

	library(tidyverse)

	sweep_detail_plot <- function(scaffold,xl,rl,gff,sf2_data){

	anno_data <- gff %>%
	filter(seqid==scaffold) %>%
	mutate(geneid=str_extract(attributes,"m[^\\;]+"))

	anno_data$type <- factor(anno_data$type)