VP Nagraj vpnagraj

Overview

This Gist includes an example script to perform basic exploratory analysis on open text data with R. The script includes steps to read in a dataset, tokenize text, summarize counts of tokens per document, perform sentiment analysis, create a document term matrix, and runs a topic modeling procedure.

Data

The script references a file called simulated_emr_data.csv. This data was created by ChatGPT 4o using the following prompt:

Simulate EMR data. Include 6 columns: Encounter ID, Patient ID, Name, Age, Visit Date, Chief Complaint, Provider Notes (free text). Include some patients with repeat visits and ensure that repeated patients have matching IDs, names, and ages. The provider notes should be at least 25 words and should mix tone and style of notes. Create a csv file with 1000 rows.

	## script to demonstrate running sets of workflows
	## adapted from the workflowsets package vignette
	## https://workflowsets.tidymodels.org/articles/evaluating-different-predictor-sets.html

	## load packages
	## NOTE: both tidymodels and tidyverse are "meta" packages ...
	## ... so they will load lots of other packages under the hood
	library(tidymodels)
	library(tidyverse)

	## -------------------------------------------------------------------------------------------------------------------
	library(tidyverse)
	#remotes::install_github("hrbrmstr/cdcfluview")
	library(cdcfluview)
	library(MMWRweek)
	library(tidyverse)
	library(plotly)
	library(gganimate)

	reprex::reprex({

	library(tidyverse)

	## define the example tibble
	dat <-
	tribble(
	~individual, ~Q1_A, ~Q1_B, ~Q1_C, ~Q1_D, ~Q2_A, ~Q2_B, ~Q2_C, ~Q2_D,
	"alice", NA, "cat", NA, NA, "tacos", NA, NA, NA,
	"bob", "dog", NA, NA, NA, NA, NA, NA, "pizza"

	###############################################################################
	## brief demo of anomaly detection in R using the timetk anomalize() function
	## current as of 2024-05-13
	###############################################################################
	## set up
	## load dplyr for data manipulation
	library(dplyr)
	## load timetk for anomaly detection functionality
	library(timetk)
	## load jsonlite to read in example data

	###############################################################################
	## brief demo of exploratory data analysis (EDA) tools for data frames in R
	## NOTE: the code below is intended to preview the EDA tools ...
	## ... it does not exhaustively demonstrate functionality for these tools ...
	## ... and it is current as of 2024-04-09 ...
	## ... for more information refer to the documentation for each package
	###############################################################################

	###############################################################################
	## set up

	library(microbenchmark)
	library(redux)
	library(svSocket)

	# clear workspace
	rm(list = ls())

	# set up svSocket
	startSocketServer()
	con <- socketConnection(host = "localhost", port = 8888, blocking = FALSE)

	library(ggplot2)
	library(tidyr)

	dat <- data.frame(x = rnorm(n = 1000, mean = 2.8, sd = 0.05),
	y1 = sample(64503:73034, size = 1000, replace = TRUE),
	y2 = sample(18738:19602, size = 1000, replace = TRUE))

	dat %>%
	ggplot() +
	geom_point(aes(x,y1)) +

	# script to demonstrate outbreak animation with a subset of mers_korea_2015 data

	# must install github release of threejs package

	# devtools::install_github("bwlewis/rthreejs")
	library(threejs)
	library(outbreaks)
	library(dplyr)

	# use dplyr to subset results to only include hospital visit exposure

	# install.packages("babynames")
	library(babynames)
	# install.packages("tidyverse")
	library(tidyverse)
	# install.packages("ggplot2")
	# install.packages("dplyr")

	# let's take a look at the data
	babynames %>%
	View()