First make sure you have a sufficiently modern java (11 or higher). You can find out with
java -version
If your java is too old you can try installing a newer java using sdkman
.
cd | |
mkdir -p ~/.local/bin | |
mkdir -p ~/.local/lib | |
git clone https://github.com/bats-core/bats-core.git | |
cd bats-core | |
./install.sh ~/.local/ | |
cd ~/.local/lib |
This method relies on bioawk . First make sure you have bioawk installed. Then download the file split_fasta.awk
from this repository. Instructions below assume you have this file available in your working directory
bin
directory if you haven't alreadycd ~
mkdir bin
The first port of call for info on the JCU HPC system is the official wiki . This gist is a supplement to the main wiki that provides some quick answers to common questions and links to this wiki as well as other useful resources.
This gist assumes that your local machine (ie your personal computer, not the HPC) is running a unix-like OS (macOS or linux). Windows users should consider setting up windows subsystem for linux so that they can also have a unix-like operating system to work with.
It is a fairly substantial collection of high performance computers. At the time of writing this constituted 15 nodes each of which has 80 cpus and just under 400Gb of memory. All the nodes are networked together so that large jobs can be distributed across multiple nodes. A range of high capacity data storage is also networked to HPC accounts as [detailed here](ht
The ms
command usage looks like this
usage: ms nsam howmany
So it is nessary to provide nsam
(The number of haplotypes to be sampled) and howmany
which is the number of replicate sets of data to generate.
For PSMC data we always choose nsam
to be 2 because the method is designed for diploid genomes. For convenience howmany
should just be set to 1
because we will rerun ms to generate separate random replicate datasets
Run these commands from within the tutorial 2 directory
mkdir -p bin
touch bin/greet.sh
Copy content from greet.sh above into the file greet.sh that you just created
Change permissions
BEGIN { in_q=0 } | |
/^```.*question_/ { | |
in_q=1; | |
match($0,"question_[a-z]*[0-9]+") | |
printf("%s(){\n",substr($0,RSTART,RLENGTH)); | |
} | |
/#Your answer here/ { |
library(tidyverse) | |
sweep_detail_plot <- function(scaffold,xl,rl,gff,sf2_data){ | |
anno_data <- gff %>% | |
filter(seqid==scaffold) %>% | |
mutate(geneid=str_extract(attributes,"m[^\\;]+")) | |
anno_data$type <- factor(anno_data$type) |
Assuming we have a fasta file of proteins with ids generated from Trinity and Transdecoder called transdecoder.pep
.
Truncate names as follows.
cat transdecoder.pep | sed -r 's/[^:]*::/>/' > transdecoder_truncated.pep
Note that on a mac you should use -E
instead of -r
This data consists of vcf file output from Stacks. See this post for some info about this output.
The general filtering strategy is as follows;
Remove sites where the minor allele frequency is too low as these might also be the result of sequencing or alignment errors in a handful of individuals.
Remove individuals where the depth is too low. Ideally we would use a likelihood based scoring measure here instead (eg GQ field) but this is not provided by Stacks.