You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Check tweets by user, some useful pandas filtering techniquesprint(len(df[df['user'] =='nntaleb']['text'].unique())) #confirm all are uniquedf.groupby('user').count()
200
created_at
retweets
text
user
Google
200
200
200
nntaleb
200
200
200
realDonaldTrump
200
200
200
Which tweets are most similar?
fromsklearn.feature_extraction.textimportCountVectorizerfromsklearn.metrics.pairwiseimportlinear_kernelfromcollectionsimportdefaultdictdocuments=df['text'].tolist() #get all tweets into a list# Since tweets are so short, not too concerned about inv doc freq normalization schemes# Just want to find tweets that share similar words, # Since tweets are short, bigram range helps expand vocabulary, which ends up haing 6400 words appx# Aside: why would a bigram vectorizer make a diff if the unigrams are counted once anyway..?# .....well, you get 3 votes, one for each unigram and one for the bigram....so, yeahvectorizer=CountVectorizer(stop_words='english', ngram_range=(1,2))
X=vectorizer.fit_transform(documents)
X.shape
(600, 6404)
# How to access the vocabulary (handy to know)vocab=dict(zip(vectorizer.get_feature_names(), X))
# Check some of the longer bigrams....sorted(vocab.items(), key=lambdax: len(x[0]), reverse=True)[:5]
[('assalehamer ctheofilopoulos',
<1x6404 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>),
('a_epiphanes4 ryansroberts',
<1x6404 sparse matrix of type '<class 'numpy.int64'>'
with 23 stored elements in Compressed Sparse Row format>),
('8z4rob3sf5 dominikleusder',
<1x6404 sparse matrix of type '<class 'numpy.int64'>'
with 23 stored elements in Compressed Sparse Row format>),
('americans overwhelmingly',
<1x6404 sparse matrix of type '<class 'numpy.int64'>'
with 17 stored elements in Compressed Sparse Row format>),
('artandapostasy franklin',
<1x6404 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>)]
# better illustrate the overall concept - take 5 tweets and limit features to 20 termssample=df['text'].tolist()[:5]
vect=CountVectorizer(min_df=0., max_df=1.0, max_features=20)
Z=vect.fit_transform(sample)
# Original concept https://gist.github.com/larsmans/3745866print(pd.DataFrame(Z.A, columns=vect.get_feature_names()).to_string())
print("--------------------------------------------------")
print(sample[0])
# sample 0 does not have "11th", "and", but contains "be", has "for" 2x, etc...# note relationship to the upper "matrix" which has a row for each docc and a column for each term# the i,jth value if the frequency of that term in the document
11th and be billion co first for great has https in national not obama of optimism or party perez the
0 0 0 1 0 0 0 2 0 1 0 0 0 1 0 1 0 1 1 1 2
1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1
2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1
4 0 0 0 2 0 2 0 0 1 0 2 1 1 1 0 0 0 0 0 2
--------------------------------------------------
Congratulations to Thomas Perez, who has just been named Chairman of the DNC. I could not be happier for him, or for the Republican Party!
# For each of the 600 tweets, we want to find the few most relevant tweets# Default dict which keys off the index of the dataframe (effectively)# X[x] is the xth entry in the count vectorized matrix Xstore=defaultdict(list)
forxinrange(600):
store[x].extend(linear_kernel(X[x], X).flatten().argsort()[:-5:-1])
# Most common elements are the element indices themselvesprint (store[0])
print (store[1])
print (store[2])
Here's the gist: you have your document row, with a number (0 or N) depending on if the tweet (document) had the term (column index) in it. Great, so you take that row as a vector (1 x nbr of terms (columns)) and you multiple it by the large overall matrix.
In order to make the matrix multiplication work you need to transpose the large matrix so you end up with a multiplication result that's the size of the nbr of tweets.
Each entry in the resulting multiply is a similarity measure of the document in question and all other documents.
Below we review the idea with the linear_kernel from sklearn, or just using numpy linear algebra operations
# Let's scope out the 33rd tweet and find some similar tweets - here use the linear_kernel function from sklearnprint(linear_kernel(X[33], X).flatten().argsort()[:-5:-1])
# Here's the same thing, but we use numpy to dot the 33rd row of X against the transpose of X. # Keep in mind X is just the Count Vectorized Matrixnp.dot(X[33], X.T).toarray().flatten().argsort()[:-5:-1]
# Let's look at the 33rd, 51st, 423rd and 88th tweets# They all seem to have "Trump" in Common# They all have a twitter short link, this could evolve into a stopword # in a "serious analysis"foridxin [33, 51, 423, 88]:
print(idx, df.iloc[idx]['text'])
33 'Trump signs bill undoing Obama coal mining rule' https://t.co/yMfT5r5RGh
51 'Remarks by President Trump at Signing of H.J. Resolution 41'
https://t.co/Q3MoCGAc54 https://t.co/yGDDTKm9Br
423 RT @normonics: The Minority Rule. cc @nntaleb https://t.co/RMCNdH8LMG
88 'Majority in Leading EU Nations Support Trump-Style Travel Ban'
Poll of more than 10,000 people in 10 countries...https://t.co/KWsIWhtC9o
Argsort, negative list slicing??
OK, let's step through it from scratch using the 33rd tweet again
# Keep going with the 33rd tweeetnp.dot(X[33], X.T).toarray().flatten()[:50]
# Notice that there's a score of 17 in the middle'ish...could that be the 33rd position?
np.dot(X[33], X.T).toarray().flatten()[33]
# Aha! Yes, wait, when you dot the thing with itse own row you get # a score proportional to the length of the item
17
Inverse Transform, a method on the vectorizer that provides non-zero entries the tweet had.....it's also the score above
# Back to argsort - note the very last elements# This is providing you the indices of the values based on their rank# for instance, the highest scores are the very last ones (note how 33 is last, the value behind it is 17)np.dot(X[33], X.T).toarray().flatten().argsort()
# Let's do some list slicing fun# ::-1 reverses the list and :5 provides the first 5 elements of the reverse sort!# Hopefully that made some sense!np.dot(X[33], X.T).toarray().flatten().argsort()[::-1][:4]
array([ 33, 51, 423, 88], dtype=int64)
A semi-pythonic way to get the similar tweets ready to be incorporated back into the dataframe
rel_tweets=defaultdict(list)
fork,vinstore.items():
fortweetinv:
rel_tweets[k].append(df.iloc[tweet]['text'])
# Check out some related tweets print("\n".join(rel_tweets[0])) # "named" is in common, "like" is in common......print("--------------------------------------------------")
# use a set to find intersection of termsprint("\n".join(rel_tweets[9])) #"given" appears twice, "classified information" in 2 tweets, etc...
Congratulations to Thomas Perez, who has just been named Chairman of the DNC. I could not be happier for him, or for the Republican Party!
Just named General H.R. McMaster National Security Advisor.
Nancy Pelosi and Fake Tears Chuck Schumer held a rally at the steps of The Supreme Court and mic did not work (a mess)-just like Dem party!
@goochthegreat Hi there. We'd like to help. Just to confirm, are you still able to sign into your account? Let us know.
--------------------------------------------------
find the leakers within the FBI itself. Classified information is being given to media that could have a devastating effect on U.S. FIND NOW
The real scandal here is that classified information is illegally given out by "intelligence" like candy. Very un-American!
Information is being illegally given to the failing @nytimes & @washingtonpost by the intelligence community (NSA and FBI?).Just like Russia
The FBI is totally unable to stop the national security "leakers" that have permeated our government for a long time. They can't even......
# I do not feel like joining it back in however via a Series - let's just use the index of the dataframe# sincee we know the X matrix is ordered row wise identically to the dataframe# Maybe a map function will help us here (use a lambda z)# You can call index on any dataframe with an indexdf['rel_tweets'] =df.index.map(lambdaz: np.dot(X[z], X.T).toarray().flatten().argsort()[::-1][:4])
df.head()
The nice thing about dataframes is that rel_tweets is actually a list.
you can perform the same operations on it as you would a list without intermediate conversions (string to split) or other operations
now we can lookup the tweet text based on the tweet indices!
we'll use a function called "get_rel_tweets" which will take a row/column from the dataframe and look-up the tweets by their index and return them in a list comprehension, for handy use down the line
defget_rel_tweets(row):
vals=row[1:] #ignore the first entry which is itselfreturn [df.iloc[x]['text'] forxinvals]
df['rel_tweet_text'] =df['rel_tweets'].map(get_rel_tweets)
df.head()
print(df.iloc[595]['text'])
print(df.iloc[595]['rel_tweet_text'])
# Here, cuttheknotmath is driving the text similarity
@CutTheKnotMath Voila. Could not find a clean inequality. https://t.co/F0UwcEIEVj
['@CutTheKnotMath Voila. https://t.co/xUNcf6RnaI', 'The smell of mathematical inequality on Sunday evening. @CutTheKnotMath https://t.co/pKbrczGMII', "Interesting discussion around the speculation of link between lifespan hearbeats/breathing and Jensen's inequality.… https://t.co/qqRJpEmbfA"]
Wait, what about the scores? How related are each of the tweets?
Let's go back to our dot product with the similarity scores , say for the 33rd tweet
# How we want the values of np.dot(X[33], X.T).toarray().flatten() as a data structure# but at the above indices....it's pretty easy# the below kinda looks crazy, but that's because I didn't use any variable namesnp.dot(X[33], X.T).toarray().flatten()[[np.dot(X[33], X.T).toarray().flatten().argsort()[::-1][:4]]]
array([17, 3, 2, 2], dtype=int64)
# So above we have the best score, then the other 2 scores? How can we get this into the dataframe now?df['scores'] =df.index.map(lambdaj:
np.dot(X[j], X.T).toarray().flatten()[[np.dot(X[j], X.T).toarray().flatten().argsort()[::-1][:4]]])
df.head(10)
# Get the sentiment for each tweetdf['sentiment'] =df['text'].map(lambdax: TextBlob(x).sentiment)
df['polarity'] =df['sentiment'].map(lambdax: x[0])
df['subjectivity'] =df['sentiment'].map(lambdax: x[1])
df.head()