This data consists of vcf file output from Stacks. See this post for some info about this output.
The general filtering strategy is as follows;
-
Remove sites where the minor allele frequency is too low as these might also be the result of sequencing or alignment errors in a handful of individuals.
-
Remove individuals where the depth is too low. Ideally we would use a likelihood based scoring measure here instead (eg GQ field) but this is not provided by Stacks.
-
Only retain biallelic sites. This makes our MAF filtering more sensible and is also a requirement of many downstream programs.
-
Remove sites where few markers have been genotyped
We can do this with vcftools as follows
vcftools --vcf my.vcf --stdout --recode --recode-INFO-all \
--maf 0.01 \
--minDP 5 \
--min-alleles 2 --max-alleles 2 \
--max-missing 0.5 \
> my_filtered.vcf
As a final step we would like to retain only one site per marker. A reasonable criterion for choosing the best site would be to keep that with the maximum number of genotyped individuals. We do this with an awk script. Since our previous filtering step filtered individuals based on depth we would have invalidated the NS field in the vcf (which counts the number of genotyped individuals). We can still filter this way but need to manually count the genotyped individuals in our awk script.
cat my_filtered.vcf | awk -f pick_best.awk > my_filtered_best.vcf