Skip to content

Instantly share code, notes, and snippets.

@cannin
Last active June 26, 2024 20:01
Show Gist options
  • Save cannin/c0a8a74f7cc4ae61a4a3f12879c55831 to your computer and use it in GitHub Desktop.
Save cannin/c0a8a74f7cc4ae61a4a3f12879c55831 to your computer and use it in GitHub Desktop.
rcellminer_chatgpt.ipynb

Getting Started

  1. Go to https://chat.openai.com/ and sign in (make an account if you do not have one)
  2. Copy and paste the following text into the text bar at the bottom of the page (after accepting the various terms and disclosures):
Imagine you are helping me interact with several dataset of cancer genomics data. At any given point of time, you have the following abilities. You are also required to output code for some of the requests.

Question: Ask me a clarification question 
Reason: Explain why you did something the way you did it. 
Code: Output R code that achieves the desired goal.

The database contains expression data for any human gene. Within the R code, we have the following commands available to us. You are not to use any other hypothetical functions.

# The following code is needed to access our data
library(rcellminer); library(rcellminerData); library(ccleData)

# The following code is needed to access the NCI60 expression dataset that returns an R matrix with cell lines as columns and genes as rows 
nci60_exp <- rcellminer::getAllFeatureData(rcellminerData::molData)[["exp"]] 

# The following code is needed to access the CCLE expression dataset that returns an R matrix with cell lines as columns and genes as rows 
ccle_exp <- rcellminer::getAllFeatureData(ccleData::molData)[["exp"]] 

# The following code is needed to create an R data frame tmpCellLineMatch with the columns "ccle" and "nci60"; this contains overlapping cell lines in the datasets
data(cellLineMatchTab); tmp_cell_line_match_tab <- cellLineMatchTab[, c("nci60", "ccle")]; tmp_cell_line_match_tab <- tmp_cell_line_match_tab[complete.cases(tmp_cell_line_match_tab),];  tmp_cell_line_match_tab <- tmp_cell_line_match_tab[tmp_cell_line_match_tab$ccle %in% colnames(ccle_exp),]

# The following code is needed to subset the data to cell lines available in both datasets
ccle_subset <- ccle_exp[, tmp_cell_line_match_tab$ccle]; nci60_subset <- nci60_exp[, tmp_cell_line_match_tab$nci60]

Are you ready?
  1. Ask your question by typing in text bar; example question:
Write R code to examine the relationship between TP53 and MDM2 in the NCI60 that can do the following: 

1. Calculate and print the correlation of the two genes; round the correlation value to two digits. 
2. Make a plot of the relationship between the two genes; TP53 should be on the x-axis and MDM2 on the y-axis. Make the plot points black circles, the title "TP53 versus MDM2 (NCI60)", the axis labels should be the appropriate gene name. 

Include steps related to data loading. 

Test Code

  1. Go to this page: https://colab.research.google.com/gist/cannin/c0a8a74f7cc4ae61a4a3f12879c55831/rcellminer_chatgpt.ipynb and follow the instructions there.
rcellminer_chatgpt
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyMuVtkpC6rVXJdEVVXY/L0j",
"include_colab_link": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/cannin/c0a8a74f7cc4ae61a4a3f12879c55831/rcellminer_chatgpt.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Install Necessary Dependencies (e.g., rcellminer)"
],
"metadata": {
"id": "aKxw8vdVdXPT"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tWYsABDkbb3O"
},
"outputs": [],
"source": [
"if (!require(\"BiocManager\", quietly = TRUE))\n",
" install.packages(\"BiocManager\")\n",
"\n",
"BiocManager::install(\"rcellminer\")"
]
},
{
"cell_type": "markdown",
"source": [
"# Test Code from ChatGPT\n",
"\n",
"**NOTE: This currently only works with NCI60 data.**\n",
"\n",
"1. Click the play button for the first code block above to intstall. \n",
"2. Replace the code below with the code from ChatGPT then the play button. "
],
"metadata": {
"id": "5AK0S2_hdVsm"
}
},
{
"cell_type": "code",
"source": [
"# Load the NCI60 expression data\n",
"library(rcellminer)\n",
"library(rcellminerData)\n",
"nci60_exp <- rcellminer::getAllFeatureData(rcellminerData::molData)[[\"exp\"]] \n",
"\n",
"# Subset to TP53 and MDM2 genes\n",
"tp53 <- nci60_exp[\"TP53\",]\n",
"mdm2 <- nci60_exp[\"MDM2\",]\n",
"\n",
"# Calculate and print correlation\n",
"correlation <- round(cor(tp53, mdm2), 2)\n",
"print(paste(\"Correlation between TP53 and MDM2:\", correlation))\n",
"\n",
"# Create scatterplot\n",
"plot(tp53, mdm2, pch=16, col=\"black\", xlab=\"TP53\", ylab=\"MDM2\", main=\"TP53 versus MDM2 (NCI60)\")"
],
"metadata": {
"id": "7tBOAGJJdSsP"
},
"execution_count": null,
"outputs": []
}
]
}
Imagine you are helping me interact with several dataset of cancer genomics data. At any given point of time, you have the following abilities. You are also required to output code for some of the requests.
Question: Ask me a clarification question
Reason: Explain why you did something the way you did it.
Code: Output R code that achieves the desired goal.
The database contains expression data for any human gene. Within the R code, we have the following commands available to us. You are not to use any other hypothetical functions. Make sure your R code provides includes library loading instructions for any R packages containing necessary data
# DEPENDENCY INSTALLATION ----
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("rcellminer")
BiocManager::install("rcellminerData")
if (!"devtools" %in% installed.packages()) install.packages("devtools")
devtools::install_github('https://github.com/CBIIT/rcellminerUtilsCDB', dependencies = TRUE, upgrade ="never")
# LOADING ----
# The following code is needed to access our data
library(rcellminer); library(rcellminerData); library(rcellminerUtilsCDB); library(ccleData)
# rcellminerData (for the NCI-60, default), ccleData, gdscDec15, ctrp, are examples other rcellminer compatible data packages that exist; code returned should only make use of data outside the default package if a user requests examples using multiple data packages.
# rcellminer compatible data packages have both molData (molecular data) and drugData (drug response data); both molData and drugData have annotation information
## The following data types may be present in rcellminer compatible data packages
# Data type|Abbreviation|Identifier ID
# Drug activity|act|drug NSC or name
# DNA Copy Number|cop|gene name
# DNA Mutation|mut|gene name
# DNA Methylation 450K|met|gene name
# DNA Methylation 850K|mth|gene name
# Body DNA Methylation 850K|bmt gene name
# DNA Methylation RRBS|rrb|gene name
# Microarray RNA Expression using z-score|exp|gene name
# Microarray RNA Expression using Avg.log2|xai|gene name
# RPLA Protein|pro|protein/gene name
# MicroRNA|mir|microRNA ID
# Miscellaneous phenotypic|mda|Miscellaneous phenotypic name
# SWATH-MS Protein|swa|protein/gene name
# RNA-seq Expression using log2.FPKM+1|xsq|gene name
# Crispr|cri|gene name
# Metabolomics| mtb|metabolite
# Histone H3K27ac|his|gene name
# Histone H3K4me3|hs4|gene name
# Surface Receptor Protein|sur|protein/gene name
# MOLECULAR DATA ----
## Not all molData will have every type of assay data; to find the data available, use the following code:
names(rcellminer::getAllFeatureData(rcellminerData::molData))
# Example output: [1] "exp" "xai" "cop" "met" "mir" "mut" "exo" "pro" "swa" "mda" "xsq" "his" "hs4" "sur"
## This code is needed to access the NCI60 expression dataset that returns an R matrix with cell lines as columns and genes as rows
nci60_exp <- rcellminer::getAllFeatureData(rcellminerData::molData)[["exp"]]
# The matrix can be accessed by standard indices and/or gene and cell line names.
nci60_exp[c("TOP1", "TOP1MT", "SLFN11"), c("BR:MCF7", "BR:MDA-MB-231", "BR:HS 578T")]
## This code is needed to access the CCLE expression dataset that returns an R matrix with cell lines as columns and genes as rows
ccle_exp <- rcellminer::getAllFeatureData(ccleData::molData)[["exp"]]
# DRUG DATA ----
# This code is needed to access the GDSC drug response data; this returns an R matrix with
nci60_act <- exprs(getAct(rcellminerData::drugData))
# Data for particular drugs and cell lines can be accessed by name; in the case of the NCI-60 these are NSC numbers
nci60_act[c("740", "609699"), c("BR:MCF7", "BR:MDA-MB-231", "BR:HS 578T")]
# The following code is needed to match drug identifiers from different projects or other retrieve identifiers for a drug; each column of drugSynonymTab is a list of vectors; drugSynonymTab is found in the rcellminerUtilsCDB package
data(drugSynonymTab)
tmp_drug_synonym_tab <- drugSynonymTab[, c("nci60", "ctrp")]
# If you use the searchListOfVectors() in your answer provide its R code in your answer
#' Search List of Vectors
#'
#' @param q query vector
#' @param lst list of vectors to search
#' @param useNames return names rather than indicies; ignored if no names. Default: FALSE
#'
#' @return a list of vectors with the same length as the query vector, each list
#' entry will have indicies for lst where there was a match with the query
#' vector. Return NA if there were no matches.
#'
#' @examples
#' lst <- list(1:3, 3:5, 3:7)
#' q <- c(3, 5)
#' results <- searchListOfVectors(q, lst)
#' names(results) <- q
#'
#' lst <- list(LETTERS[1:3], LETTERS[3:5], LETTERS[3:7])
#' q <- c("C", "E")
#' searchListOfVectors(q, lst)
searchListOfVectors <- function(q, lst, useNames=FALSE, queryAll=FALSE) {
if(useNames && is.null(names(lst))) {
stop("ERROR: List is not named")
}
if(useNames) {
tmp <- rep(names(lst), sapply(lst, length))
} else {
tmp <- rep(seq_along(lst), sapply(lst, length))
}
if(queryAll) {
resultsSe <- tmp[which(unlist(lst) %in% q)]
idx <- which(table(resultsSe) == length(q))
if(useNames) {
return(names(idx))
} else {
return(unname(idx))
}
} else {
resultsSe <- sapply(q, function(x) tmp[which(unlist(lst) %in% x)], simplify=FALSE)
if(class(resultsSe) != "list") {
return(NA)
}
return(resultsSe)
}
}
# This code is allows searching drugSynonymTab for synonyms of a particular identifier (e.g.,
tmp <- searchListOfVectors("740", drugSynonymTab$nci60)
drugSynonymTab[tmp[["740"]],]
# Result of searching drugSynonymTab; methotrexate is a synonym for NCI-60 identifier 740
# NAME_SET nci60 ctrp gdscDec15 ccle nciSclc almanac sarcoma lantern lantnci
#436 59-05-2,.... 740, 757113 methotrexate Methotrexate NA 740 NA 740 NA 740
# Annotation data for the drugs can be obtained.
nci60_drug_annot <- getFeatureAnnot(rcellminerData::drugData)[["drug"]]
nci60_drug_annot[1, ]
# NSC NAME FDA_STATUS MOA PUBCHEM_ID SMILES TOTAL_EXPS TOTAL_EXPS_AFTER_QC
#1 1 tolylquinone <NA> <NA> NA CC1=CC(=O)C=CC1=O 3 2
# SAMPLE ANNNOTATION ----
# Get the annotation data (tissue type, etc.) for the available cell lines.
nci60_line_annot <- getSampleData(rcellminerData::molData)
nci60_line_annot[1:4, ]
# Note that cell line order in the rows of the annotation data frame precisely
# matches cell line order along the columns of the expression data matrix.
identical(nci60_line_annot$Name, colnames(nci60_exp))
# We can thus extract submatrices with expression data for particular tissue types.
nci60_line_annot_breast <- nci60_exp[, nci60_line_annot_breast$OncoTree1 == "Breast"]
# RELATING MOLECULAR AND DRUG DATA ----
# Drug activity and gene expression profiles can thus be directly correlated.
cor.test(nci60_act["609699", ], nci60_exp["SLFN11", ])
# A more exhaustive pattern comparison (i.e., using correlations) can be done using the rcellminer patternComparison() function.
pattern_comparison <- patternComparison(gdscAct["609699", ], nci60_exp)
pattern_comparison[1:10, ]
# MATCHING ACROSS MULTIPLE DATASETS ----
# The following code is needed to create an R data frame tmpCellLineMatch with the columns "ccle" and "nci60"; this contains overlapping cell lines in the datasets; cellLineMatchTab is found in the rcellminerUtilsCDB package
data(cellLineMatchTab); tmp_cell_line_match_tab <- cellLineMatchTab[, c("nci60", "ccle")]; tmp_cell_line_match_tab <- tmp_cell_line_match_tab[complete.cases(tmp_cell_line_match_tab),]; tmp_cell_line_match_tab <- tmp_cell_line_match_tab[tmp_cell_line_match_tab$ccle %in% colnames(ccle_exp),];
# The following code is needed to subset the data to cell lines available in both datasets
ccle_subset <- ccle_exp[, tmp_cell_line_match_tab$ccle]; nci60_subset <- nci60_exp[, tmp_cell_line_match_tab$nci60]
Are you ready?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment