Skip to content

Instantly share code, notes, and snippets.

@iracooke
Last active May 7, 2021 07:32
Show Gist options
  • Save iracooke/0bc1f3d710ed36fa05f4a063ff692ca9 to your computer and use it in GitHub Desktop.
Save iracooke/0bc1f3d710ed36fa05f4a063ff692ca9 to your computer and use it in GitHub Desktop.
Stacks vcf filtering

Basic filtering for AGRF RADseq data

This data consists of vcf file output from Stacks. See this post for some info about this output.

The general filtering strategy is as follows;

  1. Remove sites where the minor allele frequency is too low as these might also be the result of sequencing or alignment errors in a handful of individuals.

  2. Remove individuals where the depth is too low. Ideally we would use a likelihood based scoring measure here instead (eg GQ field) but this is not provided by Stacks.

  3. Only retain biallelic sites. This makes our MAF filtering more sensible and is also a requirement of many downstream programs.

  4. Remove sites where few markers have been genotyped

We can do this with vcftools as follows

	vcftools --vcf my.vcf --stdout --recode --recode-INFO-all \
	--maf 0.01 \
	--minDP 5 \
	--min-alleles 2 --max-alleles 2 \
	--max-missing 0.5 \
	> my_filtered.vcf

As a final step we would like to retain only one site per marker. A reasonable criterion for choosing the best site would be to keep that with the maximum number of genotyped individuals. We do this with an awk script. Since our previous filtering step filtered individuals based on depth we would have invalidated the NS field in the vcf (which counts the number of genotyped individuals). We can still filter this way but need to manually count the genotyped individuals in our awk script.

	cat my_filtered.vcf | awk -f pick_best.awk > my_filtered_best.vcf
---
title: "DAPC Analysis"
output: html_notebook
---
```{r}
library(adegenet)
# pegas 0.9 on CRAN has a bug. You need to install the github version to make the loci2genind function work.
devtools::install_github("emmanuelparadis/pegas",subdir = "pegas")
library(pegas)
# Use the read.vcf function from pegas
vcf <- read.vcf("my_filtered_best.vcf")
# Convert to genind object
genind <- loci2genind(vcf)
# If you get errors about NA values you may need to use the tab function to fill them with appropriate values
grp <- find.clusters(genind, max.n.clust=40)
```
Continue from here to do a full DAPC analysis. Good tutorials are available [here](https://github.com/thibautjombart/adegenet/wiki/Tutorials)
BEGIN {
best_ns=0
best_ns_record="start"
previous_marker=-1
}
/^#/ {
print $0
}
!/#/ {
this_marker = $3
if ( best_ns_record == "start" ){
best_ns_record = $0
previous_marker = this_marker
}
# Check to see if we have transitioned to a new marker
# if we have then print the best record from the previous marker
# and reset best_ns
if ( this_marker != previous_marker ){
print best_ns_record
best_ns = 0
best_ns_record = ""
previous_marker = this_marker
}
gt_index=-1
ln = split($9,format_data,":")
for(i=1;i<ln;i++){
if(format_data[i] == "GT"){
gt_index=i
}
}
if( gt_index == -1){
print "Error: unable to find GT index on line",NR;
exit 1
}
# Extract the NS information from the current record by counting ./. genotypes
#
this_ns = 0
for(i=10;i<=NF;i++){
split($i,site_data,":")
if( site_data[gt_index] != "./."){
this_ns += 1
}
}
# Then check to see if this is better than any other records for this marker
# and update best_ns accordingly
# if ( match($8,"NS=[0-9]+")) {
#this_ns = substr($8,RSTART+3,RLENGTH-3) + 0
if ( this_ns > best_ns ){
best_ns_record=$0
best_ns=this_ns
}
# } else {
# print "Error: No NS field found on line",NR;
# exit 1
# }
}
END {
print best_ns_record
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment