here is a picture of a "RNA-seq BAM file" in JBrowse 2
you can see that there are all these little grey boxes and the black lines connecting them. These are the so-called "reads" in the RNA-seq BAM file! there is also a little "histogram" along the top. This is showing how many "reads" mapped to a given position in the genome
(also note: the orange and green top track is the so called "reference gene annotation". the reference gene annotations are from GFF downloaded from NCBI, like Parshawn found)
here is what the reads look like when zoomed in
each little grey box is a "read" that can from a DNA sequencer, in this case, an Illumina machine. but the important part: each "read" corresponds directly with a fragment RNA. this is why we call it RNA-seq.
The "central dogma of molecular biology" (https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology) tells us that ""DNA makes RNA, and RNA makes protein"". When we do RNA-seq, we get to measure the "middle part" of this dogma, the RNA part! This gives us an insight into which parts of your DNA turn into RNA
RNA-seq is not just a "qualitative" view of which parts of your genome are transcribed, it is also a quantitative measure of transcription! If you observe a bunch of "reads" from a particular area of your genome, that means it is "highly expressed"
In the screenshot below, I turned on the "compact" visualization to show the entire set of reads that stack up. Tools like HTseq (https://htseq.readthedocs.io/en/latest/) can automatically count how many reads align to each gene (represented by the orange boxes in the genome browser) which tells you, genome wide, how much each gene is being expressed
RNA are transcribed from the DNA, and then undergo splicing which removes intronic sequences. When you do RNA-seq on these fragments, and map these back to the genome, there are these big gaps from where the introns were removed. We need to use tools (like STAR https://github.com/alexdobin/STAR) that perform a "spliced alignment". This means that, potentially, part of the RNA-seq reads will get "split mapped" where part of it aligns to e.g. exon 1, and part of it aligns to exon 2.
The tools like STAR, mentioned above, output BAM files (or SAM files, same idea but a plaintext version of a BAM) which contain information about those spliced alignments using CIGAR strings.
In the above example, if we have a RNA-seq read mapping partially to exon1 and partially to exon2, the CIGAR string might say
10M 500N 10M (note: CIGAR strings normally have no spaces, just inserted here for readability though)
that means, 10 base pairs (M means match) of the read (e.g. part of a grey box) mapped to exon 1, then there is a big intron gap in the middle of 500 base pairs (N means skip, generally implying a skip over intronic region), and another 10 bases (again, M means match) part aligned to exon 2
JBrowse counts, on the fly, all the reads that have a CIGAR string with a skip (e.g. the 500N mentioned above) and then displays that as the blue arc (which, for additional reasons which are sort of subtle: we can also determine the splicing signal remember, the GT/AG surrounding the introns, and determine, in the blue arc case, that these reads are on the reverse strand of the DNA. Remember that DNA is a double helix with a "forward" and "reverse" strand (sometimes called Watson and Crick strand, or 5'->3' is forward and 3' to 5' is reverse)
If there were red arcs, it would indicate a forward strand gene
so if we look at JBrowse again, I made it non-compact again and I can mouseover a specific read
you can see the tooltip is showing a single read where, like the scenario described above, part of the read aligns to the left exon (grey) and part aligns to the right (grey). the black bar in the middle is the part that is "skipped"
my notes in Tiberius lecture sort of go into some stuff related to this also, since Tiberius is concerned with accurately and qualitatively determining gene structure as well, but notably, Tiberius does NOT use RNA-seq. RNA-seq is very helpful for determining genes, but it costs money!
RNA-seq is extremely useful and detailed biological measurement of both quantitative measurement of gene expression, and qualitative measurement of gene structure.
In eukaryotes, genes often have complex gene structures! Prokaryotes do not really have introns, but eukaryotes do, and this leads to an immense amount of complexity in the way that DNA is transcribed. There are about ~20,000 protein coding genes in the human genome (depends on how you count sometimes for the exact measure), but each gene can be "alternatively spliced" which means that different introns will be spliced out, leading to different combinations of exons being included in the final gene product. Therefore, each gene can encode a complex set of different RNA molecule instructions that go on to become the protein
Since RNA-seq is sort of like a "direct" observation of that activity in action, it's very valuable. However, RNA-seq does cost money! and as mentioned above, your cells transcribe things differently depending on different conditions, so any given RNA-seq measurement is just a measurement of the transcriptional activity under a specific biological condition. Oftentimes, scientists will need to perform several RNA-seq experiments to understand what they are working on. For example, they might perform RNA-seq under drought condition and healthy condition in a plant to determine which genes are activated in the drought condition!
Yes, I know, it's crazy. All this stuff is quite low level, but we just add lots of layers of analysis tools to crunch it down and make it meaningful! But it is still good to have a handle on the basic concepts and i think visually seeing the reads in the genome browser is a great way to do that.
Sequencing is just a magical technology that has been able to scale almost faster than we can make algorithms to deal with, with there being petabytes of data on the NCBI "sequence read archive" and the ability for people to do new sequencing projects e.g. RNA-seq experiments is from just the low-hundreds of dollars, so any scientific question that CAN be interrogated via sequncing data, WILL be :)
(as mentioned above, RNA-seq is very versatile at comparing biological conditions, etc)
Many RNA-seq data files that you see are so-called "short reads". This is often Illumina sequencing, and it reads ~150 letters of the RNA (or the DNA version of the RNA).
There are also long read versions of RNA-sequencing (from PacBio and Nanopore)
The long read sequencing can tell you all the letters of the whole gene transcript at once, whereas short read sequencing is more like putting together a puzzle.
Here is what long read RNA-sequencing looks like in JBrowse 2. The below data is called "IsoSeq"
Interestingly, like I referred to above with a read getting partially aligned to exon1 and exon2 above, for the long reads, it often aligns partially to ALL the exons
So if you had a gene with 5 exons, each 10 bp long, and introns between them 500 bp long, then a CIGAR string of a read from a long read RNA-seq experiment might look like
10M 500N 10M 500N 10M 500N 10M 500N 10M (note: CIGAR strings normally have no spaces, just inserted here for readability though)
The long read data is very "clean" and nice, compared to the short fragmented illumina sequencing, but it is more expensive! I don't know the exact numbers, but as mentioned above, scientists sometimes want to do a bunch of RNA-seq experiments at once, on different biological conditions, so costs can add up :)
This was just a miscellaneous brainstorm just to try to visually expose some people to what RNA-seq looks like in the genome browser (JBrowse 2) and also discuss just some random background information, but I also wanted to just make a final note to tie it back to the featurizer and ML project too
Unfortunately Ian's screenshot from this slide isn't actually showing the splicing, but my screenshots above do show splicing! But, the math from Ian's slides, and the feature's such as "# of reads that go from unspliced (previous position) to spliced" all depends a lot on carefully examining the CIGAR strings for all the reads overlapping a particular position in the genome
There is the massive repository of reads on NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) Reads on SRA need to be downloaded with a special tool, and you generally get FASTQ. This means the reads are not aligned to any reference genome yet (so there are no BAM files). Therefore, you have to align it yourself
Alternatively, there are some data repositories like ENCODE (https://www.encodeproject.org/) that offer the pre-aligned BAM files. Note that human data is often restricted access, and only special types of highly consented datasets such as ENCODE and 1000 genomes will have BAM files available for humans. This is because you can determine a lot about a person if you have BAM data, it is basically their whole genome, so there are privacy concerns!
I look forward to all your guys work on the project and let me know if I can help :) I have a fair amount of knowledge in the CIGAR string side of things from BAM files but less so on the practicalities of the actual machine learning workflows
If anyone wants help setting up JBrowse 2 to see their RNA-seq data of bigwig files for example, let me know!