Last active
March 22, 2017 16:38
-
-
Save smrgit/6b222a20a411212773ffa3980d240908 to your computer and use it in GitHub Desktop.
BRCA CPTAC-RNAseq correlation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| WITH | |
| -- first we get the 77 samples that passed the QC tests | |
| qcSet AS ( | |
| SELECT | |
| TCGA_case_ID AS case_barcode | |
| FROM | |
| `isb-cgc.hg19_data_previews.TCGA_Breast_SuppTable01` | |
| WHERE | |
| QC_Status="pass" ), | |
| -- | |
| -- next we extract the sample_barcode, gene, and logRatio from the | |
| -- CPTAC BRCA proteome iTRAQ table, while at the same time doing a | |
| -- JOIN with the qcSet in order to filter out samples that did not | |
| -- pass the QC tets | |
| -- this query returns 720196 avgLogRatio values for 77 samples and | |
| -- 10599 genes -- on average 9353 genes are detected for each sample | |
| p AS ( | |
| SELECT | |
| a.sample_barcode, | |
| gene, | |
| -- the AVG() function is used here for the 3 samples which | |
| -- were assayed twice; to make sure that we have just one | |
| -- value per sample per gene | |
| AVG(unshared_logRatio) AS avgLogRatio | |
| FROM | |
| `isb-cgc.hg19_data_previews.TCGA_Breast_BI_Proteom_CDAP_r2_itraq` a | |
| JOIN | |
| qcSet b | |
| ON | |
| SUBSTR(a.sample_barcode,1,12)=b.case_barcode | |
| GROUP BY | |
| sample_barcode, | |
| gene ), | |
| -- | |
| -- now we get the expression data from the UNC/RNAseq dataset | |
| -- while filtering, again using a JOIN with the table created above | |
| -- the RNAseq | |
| -- expression table has ~228M rows but we want less than 1M | |
| -- of those for this analysis | |
| -- we also filter out any samples/genes where the RNAseq | |
| -- read-count is 0 | |
| -- from this stage, we will wind up with 710493 rows, with data | |
| -- for 77 samples and 10419 genes | |
| j AS ( | |
| SELECT | |
| g.sample_barcode AS sample, | |
| g.HGNC_gene_symbol AS gene, | |
| LOG10(normalized_count+1) AS RNAseq_logCount, | |
| p.avgLogRatio AS iTRAQ_logRatio | |
| FROM | |
| `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM` g | |
| JOIN | |
| p | |
| ON | |
| SUBSTR(g.sample_barcode,1,15)=SUBSTR(p.sample_barcode,1,15) | |
| AND g.HGNC_gene_symbol=p.gene | |
| WHERE | |
| normalized_count>0), | |
| c AS ( | |
| SELECT | |
| CORR(iTRAQ_logRatio, | |
| RNAseq_logCount) AS corrByGene, | |
| COUNT(*) AS n, | |
| gene | |
| FROM | |
| j | |
| GROUP BY | |
| gene | |
| HAVING | |
| n >=20 ) | |
| SELECT | |
| * | |
| FROM | |
| c | |
| ORDER BY | |
| corrByGene DESC |
Author
Author
revision #4 modifies the JOIN condition so that only the first 15-char of the sample_barcodes are required to match -- for 3 samples, the -01A sample was used for MS/MS while the -01B sample was used for RNAseq and RPPA
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
revision #3 includes a step to get the 77 samples that passed the CPTAC QC process, and this does indeed seem to clean up / magnify the observed correlations