Skip to content

Instantly share code, notes, and snippets.

@avanishgiri
Created November 12, 2013 08:03
Show Gist options
  • Save avanishgiri/7427244 to your computer and use it in GitHub Desktop.
Save avanishgiri/7427244 to your computer and use it in GitHub Desktop.
Returning the k most frequent words in a string
def word_frequencies(string, k)
words = string.split(/\s/) # O(n) ## delimit by whitespace
max = 0
min = Float::INFINITY
# create hash table for word --> frequency #
frequencies = words.inject(Hash.new(0)) do |hash,word| # O(n)
occurrences = hash[word] += 1
max = occurrences if occurrences > max
min = occurrences if occurrences < min
hash;
end
# perform counting sort #
sorted = Array.new(max + words.length - min)
frequencies.each do |word, frequency| # O(n)
index = frequency - min #offset index by minimum index
if sorted[index]
sorted[index] = sorted[index].push(word)
else
sorted[index] = [word]
end
end
return sorted.reverse if k > sorted.length
return sorted.compact.flatten[-k..-1].reverse #O(n)
end
text = "hi hello hi my name is what what hi hello hi this is a test test test test hi hi hi what hello these are some words these these"
p word_frequencies(text, 4)
@avanishgiri
Copy link
Author

In the worst case, the input to the function could be the total number of words in the document, reducing the problem to sorting the words by their frequencies. This made me think that the lower bound for time complexity would be O(n log n) if I used a comparison sorting method. My thought was that the best approach, if wanting to preserve worst case linear time complexity, was to use a counting sort.

** This solution presumes that hash table lookup and inserts are O(1), that dynamic array inserts are O(1), and that compact, reverse, and flatten are O(n) **

Another option I considered was using was the partition algorithm from quicksort to select the k largest elements from the set, but because of the requirement that the output be returned in descending order (by frequency), this algorithm is O(w + klogk) in the average case and O(k^2) in the worst case. (where w is the number of words in the file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment