Last active
December 11, 2015 09:08
-
-
Save chris-taylor/4577441 to your computer and use it in GitHub Desktop.
Quick script for counting up most common words in a text, ignoring very common words
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| the | |
| be | |
| am | |
| are | |
| is | |
| was | |
| were | |
| been | |
| to | |
| of | |
| and | |
| a | |
| in | |
| that | |
| have | |
| has | |
| had | |
| I | |
| it | |
| for | |
| not | |
| on | |
| with | |
| he | |
| as | |
| you | |
| do | |
| does | |
| did | |
| at | |
| this | |
| but | |
| his | |
| by | |
| from | |
| they | |
| we | |
| say | |
| says | |
| said | |
| her | |
| she | |
| or | |
| an | |
| will | |
| my | |
| one | |
| all | |
| would | |
| there | |
| their | |
| what | |
| so | |
| up | |
| out | |
| if | |
| about | |
| who | |
| get | |
| gets | |
| got | |
| which | |
| where | |
| how | |
| go | |
| me | |
| when | |
| make | |
| makes | |
| made | |
| can | |
| like | |
| likes | |
| liked | |
| time | |
| no | |
| just | |
| him | |
| know | |
| knew | |
| knows | |
| take | |
| takes | |
| took | |
| people | |
| into | |
| year | |
| your | |
| good | |
| some | |
| could | |
| them | |
| see | |
| sees | |
| saw | |
| other | |
| than | |
| then | |
| now | |
| look | |
| looks | |
| looked | |
| only | |
| come | |
| comes | |
| came | |
| its | |
| over | |
| think | |
| thinks | |
| thought | |
| also | |
| back | |
| after | |
| use | |
| uses | |
| used | |
| two | |
| our | |
| work | |
| works | |
| worked | |
| first | |
| well | |
| way | |
| even | |
| new | |
| want | |
| wants | |
| wanted | |
| because | |
| any | |
| these | |
| give | |
| gives | |
| gave | |
| day | |
| most | |
| us | |
| may | |
| more | |
| such | |
| should | |
| very | |
| those | |
| same | |
| must |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from collections import Counter | |
| import string | |
| def remove_punctuation(word): | |
| return word.translate(string.maketrans("",""), string.punctuation) | |
| # Read in list of common words in lowercase. | |
| common_words = [word.lower() for word in open('common.txt').read().splitlines()] | |
| # read in file contents and add words to list. Strip punctuation and change | |
| # to lowercase first. | |
| lines = open('big.txt').read().splitlines() | |
| words = [] | |
| for line in lines: | |
| for word in line.split(): | |
| words.append(remove_punctuation(word).lower()) | |
| # Use a Counter to find the most common words that aren't in the common_words | |
| # list. | |
| cnt = Counter(words) | |
| for word in common_words: | |
| cnt[word] = 0 | |
| print cnt.most_common(10) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment