Skip to content

Instantly share code, notes, and snippets.

@cuuupid
Created March 15, 2018 01:55
Show Gist options
  • Save cuuupid/84da5801e632b41301ec5a932c1f0054 to your computer and use it in GitHub Desktop.
Save cuuupid/84da5801e632b41301ec5a932c1f0054 to your computer and use it in GitHub Desktop.
Identify keywords in text
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
LANG = "english"
tokenizer = Tokenizer(LANG)
stemmer = Stemmer(LANG)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(LANG)
def LSA(text):
parser = PlaintextParser.from_string(text, tokenizer)
sentences = list([str(s) for s in summarizer(parser.document, 10)])
return ' '.join(sentences)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
sw = set(stopwords.words(LANG))
def process(text):
summarized = LSA(text)
tokenized = word_tokenize(summarized)
filtered = [word for word in tokenized if word not in sw]
return filtered
@cuuupid
Copy link
Author

cuuupid commented Mar 15, 2018

Usage in plagiarism checker:

text = ... # load string text from somewhere, regex out special chars
keywords = process(text)
# search google using google search api for keywords and rate matches using ssim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment