Skip to content

Instantly share code, notes, and snippets.

@sloria
Created September 1, 2013 20:57
Show Gist options
  • Select an option

  • Save sloria/6407257 to your computer and use it in GitHub Desktop.

Select an option

Save sloria/6407257 to your computer and use it in GitHub Desktop.
import math
from text.blob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen. The film concerns a genetically engineered snake, a python, that
escapes and unleashes itself on a small town. It includes the classic final
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles,
California and Malibu, California. Python was followed by two sequels: Python
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""")
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
@prabhatntpc

Copy link
Copy Markdown

dear Sir,

If i have to find out Tf-Idf for mutiple files stored in a folder , than how this program will change.

@jpmallette

Copy link
Copy Markdown

Thanks for the code. Little error

from text.blob import TextBlob as tb
should be
from textblob import TextBlob as tb

@younes0

younes0 commented Oct 6, 2015

Copy link
Copy Markdown

I'm a NLP noob, how could I use this with TextBlob classifiers (Bayes/Maxent) ?

@eggie5

eggie5 commented Dec 3, 2015

Copy link
Copy Markdown

I don't know if this is a python 2 thing, but your division in the tf routine is operating on integers...

@kevincong95

Copy link
Copy Markdown

@eggie5, you can add this line to the top to coerce float division:
from future import division

@RangerWolf

Copy link
Copy Markdown

Run this script in python 2.7 got math domain error
find out that the root cause is len(bloblist) / (1 + n_containing(word, bloblist)) will likely to be 0 and log function will cause exception
same as function

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

as fix solution, float it before calculation, such as :

def tf(word, blob):
    return (float)(blob.words.count(word)) / (float)(len(blob.words))

Not sure Py3 result...

@ranjanmanish

Copy link
Copy Markdown

def idf(word, bloblist): return math.log(len(bloblist) / (float)(1 + n_containing(word, bloblist)))

idf function was throwing math domain error as well. hence I modified it. It worked. Ofcourse I also incorporated the suggestion by RangetWolf.

@jotixh

jotixh commented Apr 25, 2016

Copy link
Copy Markdown

Thanks, very useful the comments!

@tpatil2

tpatil2 commented Jun 28, 2016

Copy link
Copy Markdown

Hi, I am having output error...followed given steps:
My OUTPUT:

Top words in document 1
Word: Van, TF-IDF: 0.0
Word: both, TF-IDF: 0.0
Word: including, TF-IDF: 0.0
Top words in document 2
Word: and, TF-IDF: -0.0
Word: among, TF-IDF: 0.0
Word: snakes, TF-IDF: 0.0
Top words in document 3
Word: premium, TF-IDF: 0.0
Word: and, TF-IDF: -0.0
Word: Ian, TF-IDF: 0.0

Please Help

Thank you

@nalindabandara

Copy link
Copy Markdown

@tpatil2

convert all to float as below

def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))

def n_containing(word, bloblist):
return (float)(sum(1 for blob in bloblist if word in blob))

def idf(word, bloblist):
return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
return (float)((float)(tf(word, blob)) * (float)(idf(word, bloblist)))

@aus10powell

Copy link
Copy Markdown

Really helpful stepping into the NLP world. Thanks!

@gerraay

gerraay commented Apr 18, 2017

Copy link
Copy Markdown

Hello. Is there any way to sum all the same words in multiple documents?

I use this function to sum the same word in single document
def tf(word, blob): return blob.words.count(word)

Thank you

@nikhilcheke

Copy link
Copy Markdown

I have few documents stored in a folder, instead of writing documents data into .py file, I want access the document through code. Please help !!
Thanks in advance.

@annabonazzi

annabonazzi commented Aug 11, 2017

Copy link
Copy Markdown

Hi @nikhilcheke, I have a similar situation to yours. I am using this solution:

import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
    with open (file1, 'r') as f:
    data = f.read() # Reads document content into a string
    document = tb(data.decode("utf-8")) # Makes TextBlob object
    bloblist.append(document)

It's working for me

@vishnuragas

Copy link
Copy Markdown

After above suggested corrections, I get no error, nor it is printing any output in jupyter notebook

@victoriastuart

victoriastuart commented Dec 6, 2017

Copy link
Copy Markdown

@sashavor

sashavor commented Jan 8, 2018

Copy link
Copy Markdown

Is it possible to incorporate lemmatizing into this process? using TextBlob, for instance?

@arullroja

Copy link
Copy Markdown

i am using python27 and i got this error
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
AttributeError: 'unicode' object has no attribute 'words'

@adeyemosot

Copy link
Copy Markdown

Hi prabhatntpc,

I think this is too late but others can benefit from it.

import glob
import os

files = glob.glob(os.path.join(os.getcwd(), ':/folder', '*.txt' ))

iterate over the list getting each file

for file in files:

open the file and then call .read() to get the text

with open(file) as f:
text = f.read()

@meghanagabhushan

Copy link
Copy Markdown

I have this error -
File "C:\Users\megha\Local\Programs\Python\Python37-32\lib\site-packages\textblob\decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.

@1haa

1haa commented Jul 18, 2018

Copy link
Copy Markdown

How would it be to read data from a txt file?

@sid9394

sid9394 commented Sep 26, 2018

Copy link
Copy Markdown

How would it be to read data from a txt file?

with open ("abc.txt", "r") as myfile:
     data=myfile.read().replace('\n', '')

Here data will store the contents of your text file.

You can then use the variable "data" as required.

@RathoreRijhu

Copy link
Copy Markdown

return log(len(bloblist) / (1 + n_containing(word, bloblist)))
ValueError: math domain error

@PiyushKyushu

Copy link
Copy Markdown

Along with TF-IDF score, I also want TF score. How can I do it?

@noorkosim

Copy link
Copy Markdown

def score_tf(query, tokenized_document):
print('query:', query)
result = 0.0
for q in query:
count = term_frequency(q, tokenized_document)
tf = 1 + math.log(count)
print ("count:",count, "\tterm:",q,"\ttf:",tf)
result = result + tf
return result

def inverse_document_frequencies(term, documents):
df = 0
for d in documents:
tokenized_d = text2list(d)
if term in tokenized_d:
df = df + 1
return math.log(len(documents)/df)

how to make calculate ti.idf ?
def score_tfidf(query)
...........
please help me !

@iranvir

iranvir commented Dec 13, 2020

Copy link
Copy Markdown

Can someone help me understand line number 8?
return sum(1 for blob in bloblist if word in blob)
I don't understand how the sum(1 for ...) statement works. What is the purpose of 1 in there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment