Last active
September 12, 2017 05:43
-
-
Save talegari/5bb6ac43c3442038b9c03ebc845a1ee0 to your computer and use it in GitHub Desktop.
Read 20 Newsgroups data in R as a datatable (dataframe)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Read 20newsgroups data as a datatable (dataframe) | |
# Author: Srikanth KS | |
# license: GPL-3 | |
# | |
# download data from here: | |
# https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/20_newsgroups.tar.gz | |
# extract it and provide its location to `baseDir` on line 9 | |
baseDir = "Downloads/20_newsgroups" | |
newsGroupNames = list.files(baseDir, full.names = TRUE) | |
readText = function(directory) { | |
textFileNames = list.files(directory, full.names = TRUE) | |
text = vapply(textFileNames | |
, function(x) paste(readLines(x), collapse = " ") | |
, character(1) | |
) | |
data.table::data.table(newsgroup = basename(directory) | |
, fileName = basename(textFileNames) | |
, text = text | |
) | |
} | |
news20 = data.table::rbindlist(lapply(newsGroupNames, readText)) | |
# to convert news20 into a dataframe, run: `data.table::setDF(news20)` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment