After the alignment with hisat 0.1.5-beta [cite] to the 1kg grch 37 refernce build and marking duplicates with picard tools 1.130 [cite] done by the gcc (coauteur? van der Vries, G) using molgenis[cite]. The data further analysed by running a coordinate sort using picard tools followed by htseq-count on the reads using the ENSEMBL75[cite] database only using the last 500 bp of transcript annotations for counting expression. Since Htseq-count[cite again?] does not use the pcr duplicate flag, hard filtering for it was done using SAMtools[cite].
Link of description of the gcc pipeline: https://github.com/molgenis/NGS_RNA/blob/NGS_RNA-3.2.4/protocols/QC_Report.sh#L229
extraction of last 500 bp of transcript annotations. https://github.com/mmterpstra/pipeline-util/blob/master/bin/GTfGet1000bpExonsBeforeTES.pl
running the shell scripts:
mkdir logs/
#run the sort by name
for i in $(ls alignment/*.bam); do sbatch sortbyname.sh $i sortbyname/$(basename $i); done
#wait till jobs are done if failing restart/cancel as neccecary
for i in $(ls sortbyname/*.bam); do sbatch htseq-count.sh $i Homo_sapiens.GRCh37.75.tes500.gff tsv/$(basename $i .bam).tes500.tsv; done
#now with hard filtering for duplicates
for i in $(ls sortbyname/*.bam); do sbatch htseq-count.sh $i Homo_sapiens.GRCh37.75.tes1000.gff tsv/$(basename $i .bam).tes1000.nodups.tsv; done