-
-
Save rodricios/fee45381356c8fb36004 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
""" | |
pip install networkx distance pattern | |
In Flipboard's article[1], they kindly divulge their interpretation | |
of the summarization technique called LexRank[2]. | |
While reading Flipboard's article, you can, if followed point by point, | |
reimplement their summarization algorithm. | |
Here are the steps/excerpts that stood out to me: | |
1. We model sentences as bags of words | |
2. The strength of interaction... [can be measured by] standard | |
metrics for this, such as Jaccard similarity... | |
Note: We skip the normalization step | |
3. The normalized adjacency matrix[3] of the graph is... | |
4. We can compute the PageRank centrality measure for each sentence | |
in the document. | |
[1] http://engineering.flipboard.com/2014/10/summarization/ | |
[2] http://dl.acm.org/citation.cfm?id=1622501 | |
[3] http://en.wikipedia.org/wiki/Adjacency_matrix | |
Note: The following pictures help visualize the mirrored for-loop(?): | |
http://en.wikipedia.org/wiki/Adjacency_matrix#Examples | |
I dont know what the technical name is for that double for-loop. | |
If anyone knows, please send your answers here: | |
https://twitter.com/rodricios | |
""" | |
import distance, operator | |
import networkx as nx | |
from pattern.en import tokenize | |
from pattern.vector import Document,LEMMA | |
def summarize(text_to_summarize): | |
stokens = tokenize(text_to_summarize) | |
# STEP 1 | |
# pattern.vector's Document is a nifty bag-o-words structure, | |
# with a TF weighting scheme | |
docs = [Document(string= s, name=e,stemmer=LEMMA) | |
for e,s in enumerate(stokens) if len(s.split(" ")) > 7] | |
linkgraph = [] | |
# STEP 2 and 3 happen interwovenly | |
for doc in docs: | |
for doc_copy in docs: | |
if doc.name != doc_copy.name: | |
# STEP 2 happens here | |
wordset_a = [x[1] for x in doc.keywords()] | |
wordset_b = [y[1] for y in doc_copy.keywords()] | |
jacc_dist = distance.jaccard(wordset_a, wordset_b) | |
if jacc_dist < 1: | |
linkgraph.append((str(doc.name), #index to sentence | |
str(doc_copy.name),1-jacc_dist)) #dist. score | |
# By the time we reach here, we'd have completed STEP 3 | |
# STEP 4 | |
#I referenced this SO post for help with pagerank'ing | |
#http://stackoverflow.com/questions/9136539/how-to-weighted-edges-affect-pagerank-in-networkx | |
D=nx.DiGraph() | |
D.add_weighted_edges_from(linkgraph) | |
pagerank = nx.pagerank(D) | |
sort_pagerank = sorted(pagerank.items(),key=operator.itemgetter(1)) | |
sort_pagerank.reverse() | |
top2 = sort_pagerank[:2] | |
orderedtop2 = [int(x[0]) for x in top2] | |
orderedtop2 = sorted(orderedtop2) | |
return " ".join([ stokens[i] for i in orderedtop2 ]) | |
if __name__ == "__main__": | |
text = 'Someday I will have a place to put all my collections.\ | |
It will most likely be my basement, or a little corner of my \ | |
basement. But I didn\'t write Star Wars. If I had, I might be \ | |
able to build a museum on the sparkling lakefront of Chicago, \ | |
right next to Soldier Field. George Lucas did write Star Wars, \ | |
and his art and memorabilia collections will be housed in his \ | |
Museum of Narrative Art in the Windy City. Lucas just \ | |
announced that Beijing-based MAD Architects will design the \ | |
museum, while Chicago firm Studio Gang Architects will be \ | |
responsible for the surrounding landscape and a pedestrian \ | |
bridge that links nearby peninsula Northerly Island with the \ | |
city. It should be a stunning addition to the collection of \ | |
shoreline museums, but it has encountered opposition from \ | |
open-space advocates and Bears fans, as the museum will \ | |
occupy part of their tailgating field. In honor of the \ | |
Museum of Narrative Art and its star-studded cast of \ | |
architects, here\'s a roundup of articles from Architizer \ | |
that feature Star Wars-related architecture: Jeff Bennett\'s \ | |
Wars on Kinkade are hilarious paintings that ravage the \ | |
peaceful landscapes of Thomas Kinkade with the brutal \ | |
destruction of Star Wars. It is not unlike a contemporary \ | |
rendering, which combines Sci-fi and Romantic notions, and \ | |
we have examples with ratings. Ra di Martino, a visual artist \ | |
and filmmaker, found the ruins of Star Wars sets, and \ | |
photographed them in her two series, No More Stars (Star Wars) \ | |
and EVERY WORLD\'S A STAGE. These haunting images show a world \ | |
far, far away, now left as ghost towns. These haunting images \ | |
show a world far, far away, now left as ghost towns. We \ | |
explore the designs and the blueprints behind the architecture \ | |
of the Rebel Alliance and the Empire. Artist \u00E9 Delsaux \ | |
photoshops Star Wars characters and ships into everyday \ | |
environments. Stormtroopers roam parking lots, the Millennium \ | |
Falcon visits a Dubai construction site, and the Emperor lurks \ | |
in the suburbs. Aedas appropriates the Sandcrawler for an office \ | |
building, but replaces the weathered, rough brown material \ | |
(COR-TEN?) with shiny glass and the treads with landscaping. \ | |
The story of artist Ralph McQuarrie, the man who helped \ | |
George Lucas realize his visions.' | |
print summarize(text) |
@rodricios I doubt there is such a name for such an unoriginal concept, we just call those nested foreaches. :P
@darkf Yeah, totally I agree. I guess for some reason I just thought there was maybe a distinction between a nested foreach where the outer and inner arrays are the same vs. different.
Might be a little clearer as:
for doc, doc_copy in itertools.combinations(docs, 2):
which also removes the need for the test on .name.
Great suggestion! Someone had mentioned this here on reddit, but you did one further and showed me the code! I unfortunately have only time to respond to comments, but if you'd like to make the necessary changes, I'll merge your fork later 😄
Here is my fork with the itertools.permutations change. combinations won't yield the right result.
Hey thanks! I'm currently updating another project/repository, but once I finish, I will merge it.
Any ideas why the combinations approach wont work?
If you read @thouis comment, there doesnt need to be a:
if doc.name != doc_copy.name
lol, I should have been a little clearer. What I'm wondering is if there's a technical name for the self-nesting bit? Probably not, I'm not guessing...