Skip to content

Instantly share code, notes, and snippets.

@mmterpstra
Created November 23, 2016 09:00
Show Gist options
  • Save mmterpstra/5d8f9fcbe3d50549f784d7e1d8a9b60e to your computer and use it in GitHub Desktop.
Save mmterpstra/5d8f9fcbe3d50549f784d7e1d8a9b60e to your computer and use it in GitHub Desktop.
Lexogen work J Hof

After the alignment with hisat 0.1.5-beta [cite] to the 1kg grch 37 refernce build and marking duplicates with picard tools 1.130 [cite] done by the gcc (coauteur? van der Vries, G) using molgenis[cite]. The data further analysed by running a coordinate sort using picard tools followed by htseq-count on the reads using the ENSEMBL75[cite] database only using the last 500 bp of transcript annotations for counting expression. Since Htseq-count[cite again?] does not use the pcr duplicate flag, hard filtering for it was done using SAMtools[cite].

Link of description of the gcc pipeline: https://github.com/molgenis/NGS_RNA/blob/NGS_RNA-3.2.4/protocols/QC_Report.sh#L229

extraction of last 500 bp of transcript annotations. https://github.com/mmterpstra/pipeline-util/blob/master/bin/GTfGet1000bpExonsBeforeTES.pl

running the shell scripts:

mkdir logs/
#run the sort by name
for i in $(ls alignment/*.bam); do sbatch sortbyname.sh $i sortbyname/$(basename $i); done
#wait till jobs are done if failing restart/cancel as neccecary
for i in $(ls sortbyname/*.bam); do sbatch htseq-count.sh $i Homo_sapiens.GRCh37.75.tes500.gff tsv/$(basename $i .bam).tes500.tsv; done
#now with hard filtering for duplicates
for i in $(ls sortbyname/*.bam); do sbatch htseq-count.sh $i Homo_sapiens.GRCh37.75.tes1000.gff tsv/$(basename $i .bam).tes1000.nodups.tsv; done

#!/bin/bash
#SBATCH --job-name=Htseq_nodup
#SBATCH --output=logs/Htseq_%j.out
#SBATCH --error=logs/Htseq_%j.err
#SBATCH --partition=duo-pro
#SBATCH --time=10:00:00
#SBATCH --cpus-per-task 2
#SBATCH --mem 8gb
#SBATCH --nodes 1
#SBATCH --open-mode=append
#SBATCH --export=NONE
#SBATCH --get-user-env=L
set -e;
set -x;
set -o pipefail;
#in
inBam=$1
inGFF=$2
outTsv=$3
#out
outDir=$(dirname $outTsv)
doneBase=${outDir}/$(basename ${outTsv} .bam)
doneFile=$doneBase.$SLURM_JOB_ID.done
if [ "${outTsv}" != "" ] && [ "$doneBase" != ""]; then
if ls ${outBam}/ 1> /dev/null 2>&1 && ls $doneBase.*.done 1> /dev/null 2>&1 ; then
echo "## "$(date)" ## $0 Finished job exiting."
exit 0
else
echo "## "$(date)" ## $0 Unfinished job restarting."
if ls ${outTsv}* 1> /dev/null 2>&1 ; then
echo rm -rv ${outTsv}* >> cleanup.sh
fi
if ls ${doneBase}.*.done 1> /dev/null 2>&1 ; then
echo rm -rv ${doneBase}.*.done >> cleanup.sh
fi
fi
else
echo "define input"
fi
echo "## "$(date)" ## $0 Started "
mkdir -p $outDir;
ml SAMtools/1.3.1-foss-2015b
ml HTSeq/0.6.1p1-foss-2015b-Python-2.7.11
( samtools view -h -F 1024 ${inBam} | htseq-count -m union -s yes -t exon -i gene_id - ${inGFF} > ${outTsv} )
touch $doneFile
#!/bin/bash
#SBATCH --job-name=Htseq
#SBATCH --output=logs/Htseq_%j.out
#SBATCH --error=logs/Htseq_%j.err
#SBATCH --partition=duo-pro
#SBATCH --time=10:00:00
#SBATCH --cpus-per-task 2
#SBATCH --mem 8gb
#SBATCH --nodes 1
#SBATCH --open-mode=append
#SBATCH --export=NONE
#SBATCH --get-user-env=L
set -e;
set -x;
set -o pipefail;
#in
inBam=$1
inGFF=$2
outTsv=$3
#out
outDir=$(dirname $outTsv)
doneBase=${outDir}/$(basename ${outTsv} .bam)
doneFile=$doneBase.$SLURM_JOB_ID.done
#this is for control of input or else it will get really easy to remove your entire data
if [ "${outTsv}" != "" ] && [ "$doneBase" != ""]; then
#This checks the output location(s) and exits if already present.
# (Then you can restart all of your jobs again instead of looking up the
# specific job you need to run, when a single job has failed)
if ls ${outBam}/ 1> /dev/null 2>&1 && ls $doneBase.*.done 1> /dev/null 2>&1 ; then
echo "echo "## "$(date)" ## $0 Finished job exiting."
exit 0
else
echo "## "$(date)" ## $0 Unfinished job restarting."
if ls ${outTsv}* 1> /dev/null 2>&1 ; then
echo rm -rv ${outTsv}* >> cleanup.sh
fi
if ls ${doneBase}.*.done 1> /dev/null 2>&1 ; then
echo rm -rv ${doneBase}.*.done >> cleanup.sh
fi
fi
else
echo "define input"
fi
echo "## "$(date)" ## $0 Started "
mkdir -p $outDir;
ml SAMtools/1.3.1-foss-2015b
ml HTSeq/0.6.1p1-foss-2015b-Python-2.7.11
( samtools view -h ${inBam} | htseq-count -m union -s yes -t exon -i gene_id - ${inGFF} > ${outTsv} )
touch $doneFile
#!/bin/bash
#SBATCH --job-name=SortSam
#SBATCH --output=logs/SortSam_%j.out
#SBATCH --error=logs/SortSam_%j.err
#SBATCH --partition=duo-pro
#SBATCH --time=10:00:00
#SBATCH --cpus-per-task 2
#SBATCH --mem 8gb
#SBATCH --nodes 1
#SBATCH --open-mode=append
#SBATCH --export=NONE
#SBATCH --get-user-env=L
set -e;
set -x;
set -o pipefail;
#in
inBam=$1
outBam=$2
#out
outDir=$(dirname $outBam)
doneBase=${outDir}/$(basename ${outBam} .bam)
doneFile=$doneBase.$SLURM_JOB_ID.done
if [ "${outBam}" != "" ] && [ "$doneBase" != ""]; then
if ls ${outBam}/ 1> /dev/null 2>&1 && ls $doneBase.*.done 1> /dev/null 2>&1 ; then
echo "## "$(date)" ## $0 Finished job exiting."
exit 0
else
echo "## "$(date)" ## $0 Unfinished job restarting."
if ls ${outBam}* 1> /dev/null 2>&1 ; then
echo rm -rv ${outBam}* >> cleanup.sh
fi
if ls ${doneBase}.*.done 1> /dev/null 2>&1 ; then
echo rm -rv ${doneBase}.*.done >> cleanup.sh
fi
fi
else
echo "define input"
fi
echo "## "$(date)" ## $0 Started "
mkdir -p $outDir;
ml picard/1.140-foss-2015b-Java-1.8.0_45
java -XX:ParallelGCThreads=2 -Xmx8g -jar ${EBROOTPICARD}/picard.jar SortSam SORT_ORDER=queryname I=$1 O=$2
touch $doneFile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment