Skip to content

Instantly share code, notes, and snippets.

@lindenb
Created August 7, 2019 12:32
Show Gist options
  • Save lindenb/e5e7025edddbff0fbb64bfa7ec24590a to your computer and use it in GitHub Desktop.
Save lindenb/e5e7025edddbff0fbb64bfa7ec24590a to your computer and use it in GitHub Desktop.
ideas for some new nextflow operators (would require htsjdk lib) . was : https://twitter.com/PaoloDiTommaso/status/1159073940448432129

sampleAndBam

returns an array (sample,bam) from a bam or a file.

it gets the SAM header from a BAM or a CRAM and extracts the identifier from the sam read groups https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMReadGroupRecord.html

Would be the equivalent of:

find "${params.bamdir}" -type f -name "*.bam" | while read F
do
	samtools view -H \$F | grep '^@RG' | tr "\\t" "\\n" | grep ^SM | cut -d ':' -f 2 | head -n 1 | tr "\\n" ","  && echo \$F
done > inputs.csv
Channel
    .fromPath("/path/to/input.bam")
    .sampleAndBam()
    .map { T->  T[0]+" "+T[1] }
    .println { it }


S1 /path/to/input.bam

you can also specify wich part of the read group you want' to extract

Channel
    .fromPath("/path/to/input.bam")
    .sampleAndBam(by:"LB")
    .map { T->  T[0]+" "+T[1] }
    .println { it }

L2 /path/to/input.bam
L1 /path/to/input.bam

samDictionary

extracts the SAM sequence dictionary from a file using https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/utils/SAMSequenceDictionaryExtractor.html

returns an array (name,length,file)

Channel
    .fromPath("/path/to/input.vcf")
    .samDictionary()
    .map { T->  T[0]+",length="+T[1]+","+T[2] }
    .println { it }


chr1,length=249250621,/path/to/input.vcf
chr2,length=243199373,/path/to/input.vcf

sampleInVcf

returns an array (sampleName,file)

extract the Samples from a VCF using : https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFHeader.html#getSampleNamesInOrder--

would be the equivalent of

bcftools view --header-only ${vcf} | grep "^#CHROM" -m1 | cut -f 10- | tr "\\t" "\\n" | awk '{printf("%s,${vcf}\\n",\$1);}
Channel
    .fromPath("/path/to/input.vcf")
    .sampleInVcf()
    .map { T->  T[0]+","+T[2] }
    .println { it }


sample1,/path/to/input.vcf
sample2,/path/to/input.vcf
sample3,/path/to/input.vcf
sample4,/path/to/input.vcf
@lindenb
Copy link
Author

lindenb commented Aug 7, 2019

A timeline is absolutely NOT expected.

of course :-)

@pditommaso
Copy link

😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment