Skip to content

Instantly share code, notes, and snippets.

@lindenb
Created August 7, 2019 12:32
Show Gist options
  • Save lindenb/e5e7025edddbff0fbb64bfa7ec24590a to your computer and use it in GitHub Desktop.
Save lindenb/e5e7025edddbff0fbb64bfa7ec24590a to your computer and use it in GitHub Desktop.
ideas for some new nextflow operators (would require htsjdk lib) . was : https://twitter.com/PaoloDiTommaso/status/1159073940448432129

sampleAndBam

returns an array (sample,bam) from a bam or a file.

it gets the SAM header from a BAM or a CRAM and extracts the identifier from the sam read groups https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMReadGroupRecord.html

Would be the equivalent of:

find "${params.bamdir}" -type f -name "*.bam" | while read F
do
	samtools view -H \$F | grep '^@RG' | tr "\\t" "\\n" | grep ^SM | cut -d ':' -f 2 | head -n 1 | tr "\\n" ","  && echo \$F
done > inputs.csv
Channel
    .fromPath("/path/to/input.bam")
    .sampleAndBam()
    .map { T->  T[0]+" "+T[1] }
    .println { it }


S1 /path/to/input.bam

you can also specify wich part of the read group you want' to extract

Channel
    .fromPath("/path/to/input.bam")
    .sampleAndBam(by:"LB")
    .map { T->  T[0]+" "+T[1] }
    .println { it }

L2 /path/to/input.bam
L1 /path/to/input.bam

samDictionary

extracts the SAM sequence dictionary from a file using https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/utils/SAMSequenceDictionaryExtractor.html

returns an array (name,length,file)

Channel
    .fromPath("/path/to/input.vcf")
    .samDictionary()
    .map { T->  T[0]+",length="+T[1]+","+T[2] }
    .println { it }


chr1,length=249250621,/path/to/input.vcf
chr2,length=243199373,/path/to/input.vcf

sampleInVcf

returns an array (sampleName,file)

extract the Samples from a VCF using : https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFHeader.html#getSampleNamesInOrder--

would be the equivalent of

bcftools view --header-only ${vcf} | grep "^#CHROM" -m1 | cut -f 10- | tr "\\t" "\\n" | awk '{printf("%s,${vcf}\\n",\$1);}
Channel
    .fromPath("/path/to/input.vcf")
    .sampleInVcf()
    .map { T->  T[0]+","+T[2] }
    .println { it }


sample1,/path/to/input.vcf
sample2,/path/to/input.vcf
sample3,/path/to/input.vcf
sample4,/path/to/input.vcf
@pditommaso
Copy link

Looks cool. I would only suggest maintaining the same naming pattern, something like: splitBam, splitSamDictionary, splitVcf .. or something along these lines.

@lindenb
Copy link
Author

lindenb commented Aug 7, 2019

@pditommaso I could implement this in my dead times but i need a few hints: how do you add one or more libraries in nf ? which already extisting base class should I copy + paste to implements those operators (+test ) ? where should I add the documentation ?

@pditommaso
Copy link

pditommaso commented Aug 7, 2019

A timeline is absolutely NOT expected. All splitters extend AbstractSplitter.

See TextSplitter and TextSplitterTest for an example.

The class is expected to be named SomethingSplitter, then it will be automatically exposed as .splitSomething.

Once you have the slitter class it can be exposed as an operator adding an extension method. For example splitText.

Since it expected to depend on some base class in the main NF module, also this should go in the same page as the current splitter, even tho it would be a nice to put in a separate module at some point.

The docs is here.

@lindenb
Copy link
Author

lindenb commented Aug 7, 2019

A timeline is absolutely NOT expected.

of course :-)

@pditommaso
Copy link

😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment