Created
April 25, 2014 02:17
-
-
Save jeroenjanssens/11275916 to your computer and use it in GitHub Desktop.
Get top N words from STDIN using Bash, Python, and R. All three scripts produce the same output, but R scales very badly w.r.t. to input size. What am I doing wrong?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
import re | |
import sys | |
from collections import Counter | |
num_words = int(sys.argv[1]) | |
text = sys.stdin.read() | |
text = text.lower() | |
words = re.split('\W+', text) | |
cnt = Counter(words) | |
for word, count in cnt.most_common(num_words): | |
print "%8d %s" % (count, word) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env Rscript | |
num.words <- as.integer(commandArgs(trailingOnly = TRUE)) | |
f <- file("stdin") | |
input.lines <- readLines(f) | |
close(f) | |
full.text <- tolower(paste(input.lines, collapse = " ")) | |
splits <- gregexpr("\\w+", full.text) | |
words.all <- (regmatches(full.text, splits)[[1]]) | |
words.unique <- as.data.frame(table(words.all)) | |
words.sorted <- words.unique[order(-words.unique$Freq),] | |
dummy <- mapply(function(w, c) { | |
cat(sprintf("%8d %s\n", c, w)) | |
}, head(words.sorted$words, num.words), head(words.sorted$Freq, num.words)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
NUM_WORDS="$1" | |
tr '[:upper:]' '[:lower:]' | | |
grep -oE '\w+' | | |
sort | | |
uniq -c | | |
sort -nr | | |
head -n $NUM_WORDS |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Stumbled upon this gist via a tweet. Interesting... and the following snippet might be something worth considering for R. Although, if you are intent on using only the standard library then this is probably not suitable since it relies on
dplyr
.No idea how it scales in comparison to your example, but worth a shot :) Usage
cat something | ./file_name <num_words>