Skip to content

Instantly share code, notes, and snippets.

@antriv
Last active April 26, 2018 19:51
Show Gist options
  • Save antriv/d98fb541101e17dd2320703473760dcd to your computer and use it in GitHub Desktop.
Save antriv/d98fb541101e17dd2320703473760dcd to your computer and use it in GitHub Desktop.
Convert a folder of text files into a single CSV file with one column for the file names and one column of the text of the file. A function in R.
#' Compiling several text files into a single CSV file
#'
#' Convert a folder of text files into a single CSV file
#' with one column for the file names and one column of the
#' text of the file. A function in R.
#'
#' To use this function for the first time run this next line:
#' install.packages("devtools")
#' then thereafter you just need to load the function
#' fom github like so, with these two lines:
#' library(devtools) # windows users need Rtools installed, mac users need XCode installed
#' source_url("https://gist.github.com/antriv/d98fb541101e17dd2320703473760dcd/raw/text2csv.R")
#'
#' Here's how to set the arguments to the function:
#'
#' mydir is the full path of the folder that contains your txt files
#' for example "C:/Downloads/mytextfiles" Note that it must have
#' quote marks around it and forward slashes, which are not default
#' in windows.
#'
#' mycsvfilename is the name that you want your CSV file to
#' have, it must have quote marks around it, but not
#' the .csv bit at the end
#'
#' A full example, assuming you've sourced the
#' function from github already:
#'
#' txt2csv("/data/MRC_Google_Compete/Gutenberg/gutenberg_cleaned_data", "mybigcsvfile")
#'
#' and after a moment you'll get a message in the R console
#' saying 'Your CSV file is called mybigcsvfile.csv and
#' can be found in C:/Downloads/mytextfiles'
txt2csv <- function(mydir, mycsvfilename){
# Get the names of all the txt files (and only txt files)
myfiles <- list.files(mydir, full.names = TRUE, pattern = "*.txt")
# Read the actual contexts of the text files into R and rearrange a little.
# create a list of dataframes containing the text
mytxts <- lapply(myfiles, read.delim)
# combine the rows of each dataframe to make one
# long character vector where each item in the vector
# is a single text file
mytxts1lines <- unlist(lapply(mytxts, function(i) unname(apply(i, 2, paste, collapse=" "))))
# make a dataframe with the file names and texts
mytxtsdf <- data.frame(filename = basename(myfiles), # just use filename as text identifier
fulltext = mytxts1lines) # full text character vectors in col 2
# Now write them all into a single CSV file, one txt file per row
setwd(mydir) # make sure the CSV goes into the dir where the txt files are
# write the CSV file...
write.table(mytxtsdf, file = paste0(mycsvfilename, ".csv"), sep = ",", quote=NULL, comment='', row.names = FALSE, col.names = FALSE)
# now check your folder to see the csv file
message(paste0("Your CSV file is called ", paste0(mycsvfilename, ".csv"), ' and can be found in ', getwd()))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment