In this exercise we are going to download human genomic data generated by the New York Genome Center (NYGC). Information about the data can be found here and details about their pipeline can be found here.
We will start by setting up a git repository on the cluster to hold our project data.
$ mkdir brca1-project
$ cd brca1-project
$ git init
$ mkdir scripts
$ mkdir data
$ mkdir results
$ nano .gitignore
$ git add .gitignore
$ cat .gitignore
*.cram
*.crai
*.bcf
Next we will download metadata for the project.
$ cd data
$ curl -OLJ http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_698_related_high_coverage.sequence.index
$ git add 1000G_698_related_high_coverage.sequence.index
$ cd ..
We will not be downloading all the data. Instead we will download the portion of the CRAM
files that contain data aligned to the BRCA1 locus. This location is located on Chromosome 17 between positions 43044295 and 43125364. To download the data will will use a script to make another script which can download the data for us using a cluster job.
$ nano scripts/pbatch_header.txt
$ cat scripts/pbatch_header.txt
#!/bin/bash
#SBATCH -N 1 # number of nodes
#SBATCH -n 1 # number of "tasks" (default: allocates 1 core per task)
#SBATCH -t 0-10:00:00 # time in d-hh:mm:ss
#SBATCH -p serial # partition
#SBATCH -q normal # QOS
#SBATCH -o slurm.%j.out # file to save job's STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err # file to save job's STDERR (%j = JobId)
#SBATCH --mail-type=ALL # Send an e-mail when a job starts, stops, or fails
#SBATCH [email protected] # Mail-to address
#SBATCH --export=NONE # Purge the job-submitting shell environment
$ nano scripts/make_download.bash
$ cat scripts/make_download.bash
cat scripts/pbatch_header.txt
echo "module purge"
echo "module add samtools/1.12.0"
echo ""
echo "mkdir -p data/cram"
echo "cd data/cram"
cat data/1000G_698_related_high_coverage.sequence.index |
grep -v '^#' | head -n 30 |
awk -F'\t' '{print "samtools view --write-index -o " $10 ".bcra1.cram " $1 " chr17:43044295-43125364"}'
Now let's run our script to create our download script and submit it to the cluster.
$ bash scripts/make_download.bash > scripts/download.bash
$ sbatch scripts/download.bash
After the job has finished we can clean up the indexes we don't need anymore.
$ rm data/cram/*.final.cram.crai
$ git add scripts/pbatch_header.txt scripts/make_download.bash scripts/download.bash
If you are having trouble downloading the genomic data because of network issues, you can try running the following command instead.
curl -sL 'https://www.dropbox.com/s/ifg2s6b0c34lz2w/brca1-project-data.tar.gz?dl=1' | tar xvf -
After acquiring our data we are going to run an interactive session and merge our cram files into a single file. This will make variant calling a bit easier.
$ interactive
$ nano scripts/merge_crams.bash
$ cat scripts/merge_crams.bash
module purge
module add samtools/1.12.0
mkdir -p data/cram_joined
samtools merge --write-index data/cram_joined/brca1_data.cram data/cram/*.cram
$ bash scripts/merge_crams.bash
$ git add scripts/merge_crams.bash
We will call variants using two different methods, and the first method will will use is bcftools.
$ nano scripts/bcftools_call.bash
$ cat scripts/bcftools_call.bash
ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
module purge
module add bcftools/1.12.0
mkdir -p results/vcf
bcftools mpileup -O b -a AD,DP -o results/vcf/bcftools.raw.bcf \
-d 10000 -f $ref data/cram_joined/brca1_data.cram
bcftools call --ploidy 2 -m -v -o results/vcf/bctools.calls.vcf \
results/vcf/bcftools.raw.bcf
$ bash scripts/bcftools_call.bash
$ git add scripts/bcftools_call.bash
The second method that we will use to call variants is GATK's HaplotypeCaller.
$ nano scripts/gatk_call.bash
$ cat scripts/gatk_call.bash
ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
module purge
module add gatk/4.2.5.0
gatk HaplotypeCaller \
-A DepthPerSampleHC \
-A DepthPerAlleleBySample \
-A Coverage \
-A ChromosomeCounts \
-R $ref \
-L chr17 \
-I data/cram_joined/brca1_data.cram \
-O results/vcf/gatk.calls.vcf
$ bash scripts/gatk_call.bash
$ git add scripts/gatk_call.bash
In order to compare the variant calls we will normalize them. Normalization shifts indels to the left and breaks up complex variant calls into bi-allelic variants.
$ nano scripts/filter_and_norm.bash
$ cat scripts/filter_and_norm.bash
input="$1"
ref=/data/datasets/community/genome/human/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
base=$(basename "$input" .calls.vcf)
dir=$(dirname "$input")
module purge
module add bcftools/1.12.0
bcftools view -o "${dir}/${base}.goodcalls.vcf" -e 'QUAL<50' "$input"
bcftools norm -o "${dir}/${base}.norm.vcf" -m -both -f $ref "${dir}/${base}.goodcalls.vcf"
$ bash scripts/filter_and_norm.bash results/vcf/bcftools.calls.vcf
$ bash scripts/filter_and_norm.bash results/vcf/gatk.calls.vcf
$ git add scripts/filter_and_norm.bash
$ git add results/vcf/*.vcf results/vcf/*.idx