Skip to content

Instantly share code, notes, and snippets.

View Steboss89's full-sized avatar

Stefano Bosisio Steboss89

View GitHub Profile
@Steboss89
Steboss89 / countvectorizer.py
Created June 2, 2022 21:28
CountVectoriszer and cosine similarity to get the level of similarity across texts
def create_heatmap(similarity, cmap = "YlGnBu"):
df = pd.DataFrame(similarity)
df.columns = ['john', 'luke','mark', 'matt'] #ohn 0 mark 2 matt 3 luke 1
df.index = ['john', 'luke','mark', 'matt']
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(df, cmap=cmap)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
@Steboss89
Steboss89 / tsne_word_embd.py
Last active June 2, 2022 18:09
T-SNE function for word embeddings
def get_word_frequencies(text):
r""" This function return a Counter with the most common words
in a given text
Parameters
----------
text: df['text'].tolist()
Return
------
@Steboss89
Steboss89 / word_embd.py
Created June 1, 2022 21:53
Create word embeddings
# use the documents' list as a column in a dataframe
df = pd.DataFrame(data, columns=["text"])
def get_word2vec(text):
r"""
Parameters
-----------
text: str, text from dataframe, df['text'].tolist()"""
num_workers = multiprocessing.cpu_count()
@Steboss89
Steboss89 / LDA_oldtest.py
Created June 1, 2022 21:01
Run LDA on Old Testament
def format_topics_sentences(ldamodel, corpus):
r"""This function associate to each review the dominant topic
Parameters
----------
lda_model: gensim lda_model
The current lda model calculated
corpus: gensim corpus
this is the corpus from the reviews
@Steboss89
Steboss89 / LDA_oldtest.py
Created May 31, 2022 21:01
LDA on Old Testament books
def format_topics_sentences(ldamodel, corpus):
r"""This function associate to each review the dominant topic
Parameters
----------
lda_model: gensim lda_model
The current lda model calculated
corpus: gensim corpus
this is the corpus from the reviews
@Steboss89
Steboss89 / results.csv
Created May 17, 2022 09:14
Results from BrainGB for the HIV dataset
Design Method HIV
Accuracy F1 AUC
GNN Connection rofile 65.71 +/- 13.85 64.11 +/- 13.99 75.10 +/- 16.95
Message Passing Node concat 70.00 +/- 15.91 68.83 +/- 17.57 77.96 +/- 8.20
Attention Node concat 71.43 +/- 9.04 70.47 +/- 9.26 82.04 +/- 11.21
Pooling Concat pooling 65.71 +/- 13.85 64.11 +/- 13.99 75.10 v+/- 16.95
Old Testament New Testament
beauty angels
families apostles/disciples
cattles baptized
cubits charity
convenant cross/cruficied
valiant faith
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 2 columns, instead of 3 in line 1.
Old Testament, New Testament
beauty, angles,
families, apostles/disciples
cattles, baptized
cubits, charity
convenant, cross/cruficied
valiant, faith
@Steboss89
Steboss89 / normcdf_hmean.py
Created April 13, 2022 10:40
Relate the word's occurrence to the hmean of normcdf of the total and class rate
def normcdf(x):
return norm.cdf(x, x.mean(), x.std())
# compute the rate of a word: word_occurrence_old_test/total
term_freq_df.loc[:,'old_rate'] = term_freq_df[0] * 1./term_freq_df['total']
# rate the word appear in a class, in this case old testament word_occurrence_old_test/total_old_test
term_freq_df.loc[:,'old_freq_pct'] = term_freq_df[0] * 1./term_freq_df[0].sum()
# combine the total rate and the class rate with the harmonic mean, to weight over most unique and specific words
term_freq_df.loc[:,'old_hmean'] = term_freq_df.apply(lambda x: (hmean([x['old_rate'], x['old_freq_pct']]) if x['old_rate'] > 0 and x['old_freq_pct'] > 0 else 0), axis=1)
# where old_rate or old_freq_pct lies in the distribution in terms of cumulative manner.
@Steboss89
Steboss89 / word_appearance.csv
Created April 13, 2022 09:42
Old vs New Testament, word occurrences
Word Old T % New T %
lord 2.34 0.92
god 1.28 1.84
israel 0.82 0.09
king 0.78 0.10
people 0.70 0.23
jesus 0.00 1.18
man 0.65 1.05
christ 0.07 0.90
father 0.21 0.46