Created
November 12, 2013 08:03
-
-
Save avanishgiri/7427244 to your computer and use it in GitHub Desktop.
Returning the k most frequent words in a string
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def word_frequencies(string, k) | |
words = string.split(/\s/) # O(n) ## delimit by whitespace | |
max = 0 | |
min = Float::INFINITY | |
# create hash table for word --> frequency # | |
frequencies = words.inject(Hash.new(0)) do |hash,word| # O(n) | |
occurrences = hash[word] += 1 | |
max = occurrences if occurrences > max | |
min = occurrences if occurrences < min | |
hash; | |
end | |
# perform counting sort # | |
sorted = Array.new(max + words.length - min) | |
frequencies.each do |word, frequency| # O(n) | |
index = frequency - min #offset index by minimum index | |
if sorted[index] | |
sorted[index] = sorted[index].push(word) | |
else | |
sorted[index] = [word] | |
end | |
end | |
return sorted.reverse if k > sorted.length | |
return sorted.compact.flatten[-k..-1].reverse #O(n) | |
end | |
text = "hi hello hi my name is what what hi hello hi this is a test test test test hi hi hi what hello these are some words these these" | |
p word_frequencies(text, 4) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the worst case, the input to the function could be the total number of words in the document, reducing the problem to sorting the words by their frequencies. This made me think that the lower bound for time complexity would be O(n log n) if I used a comparison sorting method. My thought was that the best approach, if wanting to preserve worst case linear time complexity, was to use a counting sort.
** This solution presumes that hash table lookup and inserts are O(1), that dynamic array inserts are O(1), and that compact, reverse, and flatten are O(n) **
Another option I considered was using was the partition algorithm from quicksort to select the k largest elements from the set, but because of the requirement that the output be returned in descending order (by frequency), this algorithm is O(w + klogk) in the average case and O(k^2) in the worst case. (where w is the number of words in the file)