Skip to content

Instantly share code, notes, and snippets.

View smsubrahmannian's full-sized avatar

Sooraj smsubrahmannian

  • San Francisco
View GitHub Profile
@smsubrahmannian
smsubrahmannian / LDA in 3 Simple Steps
Last active April 17, 2018 12:02
LDA in 3 Simple Steps
from gensim import models,corpora
import spacy
nlp = spacy.load('en')
data = pd.read_feather('data/preprocessed_data')
""" Step-1: clean up your text and generate list of words for each document.
I recommend you go through an introductory tutorial on Spacy in this link.
The content inside the cleanup function is designed for a specific action.
I have provided two examples in the github repo """
import spacy
nlp = spacy.load('en') # loading the language model
data = pd.read_feather('data/preprocessed_data') # reading a pandas dataframe which is stored as a feather file
def clean_up(text): # clean up your text and generate list of words for each document.
removal=['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']
text_out = []
doc= nlp(text)
for token in doc:
from gensim import models,corpora
import pyLDAvis.gensim
# lda_final is the lda model built with 12 topics
# vis is the pyLDAvis object
vis = pyLDAvis.gensim.prepare(lda_final, doc_term_matrix, dictionary,sort_topics=False)
def get_relevant_words(vis,lam=0.3,topn=10):
a = vis.topic_info
@smsubrahmannian
smsubrahmannian / Relevant words
Created April 7, 2018 07:46
Relevant words for top topics
Topic,words with Relevance
Topic9,"{'sqoop', 'kafka', 'cassandra', 'hdfs', 'hbase', 'hive', 'pig', 'impala', 'flume', 'oozie'}"
Topic10,"{'jquery', 'xml', 'css', 'eclipse', 'html', 'c', 'ajax', 'django', 'javascript', 'php'}"
Topic12,"{'sas', 'powerpoint', 'python', 'r', 'excel', 'matlab', 'spss', 'sql', 'word', 'stata'}"
Topic7,"{'classification', 'svm', 'learn', 'k', 'scikit', 'pandas', 'regression', 'matplotlib', 'scipy', 'numpy'}"
@smsubrahmannian
smsubrahmannian / Relevant_words.csv
Last active April 16, 2018 08:54
Tables in Medium
Topic Relevant words Token percentage
Topic12 {'sas', 'powerpoint', 'python', 'r', 'excel', 'matlab', 'spss', 'sql', 'word', 'stata'} 21%
Topic7 {'classification', 'svm', 'learn', 'k', 'scikit', 'pandas', 'regression', 'matplotlib', 'scipy', 'numpy'} 20.3%
Topic9 {'sqoop', 'kafka', 'cassandra', 'hdfs', 'hbase', 'hive', 'pig', 'impala', 'flume', 'oozie'} 17.5%
Topic10 {'jquery', 'xml', 'css', 'eclipse', 'html', 'c', 'ajax', 'django', 'javascript', 'php'} 14.4%
Topic Proposed Topic Name
Topic12 Business Analyst skills
Topic7 Data Science Skills
Topic9 Database skills
Topic10 Web Developer skills