Skip to content

Instantly share code, notes, and snippets.

@hirocarma
Created January 5, 2021 12:12
Show Gist options
  • Save hirocarma/64ba16ec270cf29fc9055385039a07c4 to your computer and use it in GitHub Desktop.
Save hirocarma/64ba16ec270cf29fc9055385039a07c4 to your computer and use it in GitHub Desktop.
Frequency analysis of text files
import MeCab
import sys
import re
import collections
cmd, infile = sys.argv
with open(infile) as f:
text = f.read()
f.close()
m = MeCab.Tagger ('-Ochasen')
node = m.parseToNode(text)
words=[]
while node:
hinshi = node.feature.split(",")[0]
if hinshi in ["名詞"]:
origin = node.feature.split(",")[6]
words.append(origin)
node = node.next
c = collections.Counter(words)
print(c.most_common(40))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment