This Gist includes an example script to perform basic exploratory analysis on open text data with R. The script includes steps to read in a dataset, tokenize text, summarize counts of tokens per document, perform sentiment analysis, create a document term matrix, and runs a topic modeling procedure.
The script references a file called simulated_emr_data.csv
. This data was created by ChatGPT 4o using the following prompt:
Simulate EMR data. Include 6 columns: Encounter ID, Patient ID, Name, Age, Visit Date, Chief Complaint, Provider Notes (free text). Include some patients with repeat visits and ensure that repeated patients have matching IDs, names, and ages. The provider notes should be at least 25 words and should mix tone and style of notes. Create a csv file with 1000 rows.