S. mansoni, F. hepatica, H. contortus
Large release, it turned out to be quite disruptive. In response:
- S. mansoni mapping made available as cross-references
- Blog post telling people how they can cope
As of a week ago there were three comments on the live site, by:
- Tuan (for test, and deleted)
- me (later deleted)
- Alan (our best comment)
I have little data on usage. The people I demonstrated the feature to in our lab seem to interact with it via Ensembl genome browser. I designed it for the JBrowse browser, and support elsewhere was added later, as a bit of a bolt-on.
- Acrobeloides nanus, a clade IV nematode
- Ancylostoma ceylanicum: new annotation
- Hymenolepis microstoma
- Meloidogyne arenaria, alternative assembly
- Meloidogyne graminicola, draft assembly
- Taenia multiceps
- S. mansoni annotation update
- WormBase core species up to version WS267.
I spent ~three weeks on the xref pipeline to incorporate a C. elegans protein mapping from WormBase.
Result: more accurate and complete UniProt, RefSeq mRNA, RefSeq protein entries.
I’ve adapted Ensembl’s ID mapping pipeline, based on exon-on-exon matching with exonerate that gets propagated to transcript and gene level.
Genome | number of genes in current assembly | release of WBPS with previous assembly | genes successfully mapped | genes in previous assembly | fraction successfully mapped |
---|---|---|---|---|---|
ancylostoma_ceylanicum_prjna72583 | 11783 | WBPS11 | 7564 | 15892 | 0.476 |
ascaris_suum_prjna62057 | 17974 | WBPS9 | 9468 | 15260 | 0.620 |
fasciola_hepatica_prjeb25283 | 16806 | WBPS10 | 7564 | 22676 | 0.334 |
haemonchus_contortus_prjeb506 | 19430 | WBPS10 | 11439 | 21869 | 0.523 |
meloidogyne_incognita_prjeb8714 | 45351 | WBPS10 | 11977 | 19212 | 0.623 |
This is already 5 to 20% more than running the pipeline with default parameter values.
Unmapped genes get killed, and:
- user can search for them, and see what happened (i.e. they got killed)
- user can get the protein sequence
Lots of S. mansoni data from our lab published and available in ENA
Comprehensive treatment of all RNASeq data sets in ENA for our species: currently over 10k runs with metadata
No quantitative results - we show alignments in genome browser
- profiling baseline expression across sexes, life stages, and organism parts
- differential expression across contrasts
- basically informative data - gene has expression evidence or not - when this is all we can tell
- aggregated information on gene pages
- query interface
- shortlist valuable experiments
- schema for metadata
- curate data in that format
- develop analyses (wrappers around existing tools like DESeq2)
- pipelines to run above
- design database schema
- store analysis results in the database
- query code for gene pages
- UI design / implementation for gene pages
- integrating queries - the hard part!
We could use BioMart for purposes of integrating the data: it’s limited and hard to work with, but it’s our best bulk query tool.
Attributes: TPM and standard deviation for each group of samples, or fold change and p-value per contrast for differential expression.
Filters are hard - I would have to split above values into ranges and add a filter for each group of samples or contrast, but it doesn’t scale well.