ballgowncreates a ballgown object from tablemaker outputballgownrsemcreates a ballgown object from RSEM output. (not yet well-tested).gffReadandgffReadGRread GTF (annotation) files into RgffReadgives you a data framegffReadGRgives you a GRanges objectgffReadGRwithsplitByTranscript=TRUEgives you a GRangesList of transcripts in the GTF file
The stattest function:
- can give it either a full ballgown object, or a feature-by-sample matrix (see argument
gowntable) - BE CAREFUL WITH PHENOTYPE DATA ORDER
- Returns data frame: feature ID, p-values, and q-values. No nonsense. Optionally returns coefficients (fold changes) for a 2-group comparison.
- can do timecourse tests automatically
- you can specify your own model matrices
- automatically library-adjusts and log transforms, but you can turn this off.
- functions that end in
exprextract expression matrices. There are default units for each of them:texpr(bg)gives you an FPKM transcript matrixeexpr(bg)gives you an exonrcountmatrix (read counts)iexpr(bg)gives you an intronrcountmatrix (# reads supporting the intron)
- adding the
alltag to a call to anexprfunction gives you genomic location data too.texpr(bg, 'all')is particularly informative (contains gene IDs, transcript lengths, etc)
structure(bg)$trans,structure(bg)$exon, andstructure(bg)$introngive assembly structures in GRangesList (transcripts) or GRanges (exon/intron) formtGenefinds the gene to which a transcript belongs. (such grammar wow)transcriptsIDs(bg),transcriptNames(bg)give sets of transcript IDs for bg object.geneIDs(bg),geneNames(bg)give sets of gene IDs for bg object.seqnames(bg)gives the chromosome names included in the object
indexes(bg)is list, so use$indexes(bg)$t2g,indexes(bg)$i2t,indexes(bg)$e2tconnect exons, introns, transcripts, genes.indexes(bg)$bamfilescan contain paths to bam files. Probably wise to have this in same order as pData, but it's not as crucial since no existing functions depend on this.
exprfiltersubsets a ballgown object based mean transcript expression (FPKM or cov)subsetsubsets a ballgown object based either on genomic location or on phenotype
pData(bg) is the phenotype component of a ballgown object.
SUPER IMPORTANT BUSINESS: make SURE the columns of pData are in the same order as the columns of texpr(bg). I've tried to implement some automatic checks, but I'm sure they are not perfect.
getGeneswas designed to label assembled transcripts with annotated gene namesannotate_assemblymatches two sets of transcripts based on overlap & calculates percents.checkAssembledTxplots assembled & annotated transcripts togethercontainschecks whether one set of GRanges is fully contained within another. (I found this useful for determining whether a set of assembled transcripts contained coding sequences).
clusterTranscriptsdoes k-means or hierarchical clustering of the transcripts within a genepctOverlapcalculates the percent overlap between two GRanges objectscollapseTranscriptsactually calculates cluster-level expression
plotTranscriptsplots all the isoforms of a specific geneplotMeansplots the isoform structure for a gene but makes a multipanel plot, one for each phenotype group, and colors the isoforms by mean abundance.plotLatentTranscriptsdoes transcript clustering and plots the results
- One of:
- fasta file containing annotated transcripts
- GTF file containing transcript structures + chromosome sequences. (check out the igenomes indexes).
- specify groups / fold changes.
That's all you need - Polyester can automate the rest.
Can do simple simulations with one function: simulate_experiment. It's fairly well-documented (in my completely biased opinion -- please let me know if it's confusing).
You can play around with different parts of the simulation using the internal functions. They're all exported.
generate_fragmentsreverse_complementget_readsadd_errorandadd_platform_error
You can also add a bunch of biases to your simulation:
- GC bias (not on github yet, coming ASAP)
- positional bias (in
generate_fragments) - non-normal fragment length distribution (in
generate_fragments)' - non-uniform error model (using
add_platform_errorinstead ofadd_error). In my experience this is a bit slow.
I don't have all the features fully wrapped into simulate_experiment yet so usability isn't quite guaranteed for the new stuff yet. I promise I'll put them in soon (definitely before the spring Bioconductor release, and hopefully well before then).
The other major feature of Polyester is that you can just define a read count matrix (number of reads to simulate per transcript per sample) and simulate that exact number of reads per transcript. The function is simulate_experiment_countmat.
I have some code for ROC curves hidden in the ballgown package ballgown:::assessSim. (It's too specific to export, but code for calculating sensitivity/specificity for an annotated/assembled simulation is a huge pain to write, so I stuck it in Ballgown. The sim_tx_info.txt file output by Polyester is basically the exact input you need for this).