Skip to content

Instantly share code, notes, and snippets.

@srynobio
Created February 10, 2020 19:28
Show Gist options
  • Save srynobio/c43e8689d2eb7c929dc7168a2e4e532f to your computer and use it in GitHub Desktop.
Save srynobio/c43e8689d2eb7c929dc7168a2e4e532f to your computer and use it in GitHub Desktop.

Run below as script (after configuring the bash variables)

First step subsets VCF to trio for Slivar (discard sites where family is all homref)

Second step subsets the trio VCF to individuals for Fabric upload (keep homref sites)

Notes:

  1. Reheader VCF to correct bug in Sentieon where AD format in header is "." instead of "R" This is important for both decomposing multi-allelic sites and subsetting by sample. This step will no longer be necessary if Sentieon fixes bug

  2. For GRCh38, filtering by AF tag to remove variants common in the joint-called backgrounds is necessary to remove common variants that are missed by a gnomAD liftover from GRCh37-called variants due to allele or strand swaps Set frequency threshold slightly higher than usual (e.g. 2% in below example) since your family's alleles are included in the AF calculation. Will no longer be necessary when using gnomAD v3 (natively aligned and called to GRCh38)

  3. Fabric's implementation of VAAST does not consider variants without a "PASS" in the FILTER field. Step 2 below will erase the FILTER field and replace with ".", in which case VAAST will consider all variants for maximum sensitivity, at the expense of more false positives. UCGD will change VQSR sensitivity for NeoSeq from 90.0 to 99.9%; will mostly eliminate the need to do this.

  4. To reduce false positive de novo calls, consider running the GATK tool PhaseByTransmission on the trio VCF, according to the separate workflow.

#!/bin/bash
#
PROJ=<19-08-07_RPN-Tristani-Coarctation>
ANL=<A637>
YEAR=<2019>
STAMP=<1566616666>
WD=</your/working/directory>
#
VCF=/scratch/ucgd/lustre/UCGD_Datahub/Repository/AnalysisData/"$YEAR"/"$ANL"/"$PROJ"/UCGD/GRCh38/VCF/Complete/"$PROJ".Final_"$STAMP".vcf.gz
PED=/scratch/ucgd/lustre/UCGD_Datahub/Repository/AnalysisData/"$YEAR"/"$ANL"/"$PROJ"/UCGD/GRCh38/Reports/"$PROJ".ped
FASTA=/scratch/ucgd/lustre/common/data/Reference/GRCh38/human_g1k_v38_decoy_phix.fasta
#
cd $WD
grep -iv 'Kindred_ID' $PED | cut -f 2 | sort > "$PROJ"_ped-sample-list.txt
#
# 1) Reheader, decompose & normalize, filter by AF (for GRCh38), and split out family VCF
# Remove variants hom ref in all family members, i.e. background-only variants
# https://github.com/brentp/slivar/wiki/decomposing-and-subsetting-vcfs
#
bcftools view -h $VCF | sed 's/##FORMAT=<ID=AD,Number=\./##FORMAT=<ID=AD,Number=R/' > header.vcf
bcftools reheader -h header.vcf $VCF | \
bcftools norm -Ou -m - -f $FASTA -w 10000 - | \
bcftools view -Ou --include 'AF<0.02' - | \
bcftools view -Oz -S "$PROJ"_ped-sample-list.txt --no-update -a --min-ac=1:nref - -o "$PROJ"_rehead-norm-rare-fam.vcf.gz
#
VCFR="$PROJ"_rehead-norm-rare-fam.vcf.gz # Use this for Slivar or other filtering
#
# 2) Split out per-sample VCFs for Fabric upload (keep homrefs for Fabric filtering) and remove VQSR FILTER (see note)
#
mkdir ./Fabric_vcfs
cat "$PROJ"_ped-sample-list.txt | parallel 'bcftools view -Ou -s {} '$VCFR' | \
bcftools annotate -Ov -x FILTER - | gzip -c > ./Fabric_vcfs/{}_rare-allpass.vcf.gz'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment