params.fasta = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fasta"
params.fai = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fai"
params.dbsnp = "/ybod2/cavery/Resources/dbsnp_146.hg38.vcf"
params.reference_mills_1000G = "/ybod2/cavery/Resources/Mills_and_1000G_gold_standard.indels.hg38.vcf"
params.reference_1000G_HighConfSNP = "/ybod2/cavery/Resources/1000G_phase1.snps.high_confidence.hg38.vcf"
params.results = "/ybod2/cavery/gatk_nextflow_files/"
Channel.fromFilePairs("/ybod2/cavery/Emory_Prahal_JIA_Fam/SL88821/*_R{1,2}_001.fastq.gz", flat: true)
.set { reads }
fasta_file = file(params.fasta)
fai_file = file(params.fai)
dbsnp_file = file(params.dbsnp)
reference_mills_1000G_file = file(params.reference_mills_1000G)
reference_1000G_HighConfSNP_file = file(params.reference_1000G_HighConfSNP)
Here you take values in as params and then set them as files, unless there is a reason for the second step it's not necessary as they are paths to "background" data files. Also keep in mind that when params are set they can be referenced anywhere in your code using the following syntax:
params.fasta = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fasta"
${fasta} or !{fasta}
depending on what process type you use. If you're using script
the '$' version is what you want.
process alignment {
Input:
file ref_fasta from fasta_file, set val(id), file(read1), file(read2) from reads
input:
file ref_fasta from fasta_file <--- not needed as you created a variable (above)
set val(id), file(read1), file(read2) from reads
## set is only used for values all coming from the same input channel.
output:
## for the flagstat below example:
set id, file("${id}_sorted_Dedup.bam") into bams_out
Script:
"""
bwa mem -t 16 -k 24 -R "@RG\tID:{id}\tPL:ILLUMINA\tPU:ILLUMINA-1\tSM:${id}" ${reference_fasta} ${read1}.fastq.gz ${read2}.fastq.gz | /ybod/csteely/bin/samblaster/samblaster --addMateTags |
sambamba view -f bam -l 0 -S /dev/stdin | sambamba sort --tmpdir=temp_storage/ -m 50G -o ${id_sorted_Dedup.bam /dev/stdin
"""
}
process flagstat {
Input:
file sorted_dedup_bams, set val(id)
## Here an issue occurs because you are trying to set a variable 'id' but there is no incoming channel that has it.
## something more like the following (if values were coming through the channel).
set val(id), file(sorted_dedup_bams) from bams_out
Output:
file ("${id}.sorted_Dedup.flagstat") into flagstats_channel
Script:
"""
samtools flagstat ${sorted_dedup_bams} > ${id}.sorted_Dedup.flagstat 2> flagstat.log-1
## this is correct.
"""
}
### so now that you have the content of bam_out going to two different places you need to split the channel using into: https://www.nextflow.io/docs/latest/operator.html#into
process base_recal {
conda '/ybod2/cavery/Bin/gatk-4.0.11.0/gatkcondaenv.yml' (best practice)
Input (updated):
Yours: set file ref_fasta from fasta_file, val(id), file(sorted_dedup_bams) from bams_out
set val(id), file(sorted_dedup_bams) from new_channel_name_from_the_split
These last three do not need to come in through channels as they have been set as variables, and you are using them correctly below.
file ref_dbsnp from dbsnp_file
file ref_mills from reference_mills_1000G_file,
file ref_highconf from reference_1000G_HighConfSNP_file
Output:
file ("${id}.recal_data.table") into bqsr_apply_table
Script:
"""
gatk BaseRecalibrator -R ${ref_fasta} -I ${sorted_dedup_bams} --known-sites ${ref_dbsnp} --known sites ${ref_mills} --known-sites ${ref_highconf} -O ${id}.recal_data.table
"""
}
Here is some of my public codebase using Nextflow and some links to specific ideas from above.
Here I use a variable to populate throughout the main nf script (search to see) splitting channels and here
Using set