Steboss89’s gists

Steboss89 / countvectorizer.py

Created June 2, 2022 21:28

CountVectoriszer and cosine similarity to get the level of similarity across texts

	def create_heatmap(similarity, cmap = "YlGnBu"):
	df = pd.DataFrame(similarity)
	df.columns = ['john', 'luke','mark', 'matt'] #ohn 0 mark 2 matt 3 luke 1
	df.index = ['john', 'luke','mark', 'matt']
	fig, ax = plt.subplots(figsize=(5,5))
	sns.heatmap(df, cmap=cmap)

	from sklearn.metrics.pairwise import cosine_similarity
	from sklearn.feature_extraction.text import CountVectorizer
	import seaborn as sns

Steboss89 / tsne_word_embd.py

Last active June 2, 2022 18:09

T-SNE function for word embeddings

	def get_word_frequencies(text):
	r""" This function return a Counter with the most common words
	in a given text

	Parameters
	----------
	text: df['text'].tolist()

	Return
	------

Steboss89 / word_embd.py

Created June 1, 2022 21:53

Create word embeddings

	# use the documents' list as a column in a dataframe
	df = pd.DataFrame(data, columns=["text"])


	def get_word2vec(text):
	r"""
	Parameters
	-----------
	text: str, text from dataframe, df['text'].tolist()"""
	num_workers = multiprocessing.cpu_count()

Steboss89 / LDA_oldtest.py

Created June 1, 2022 21:01

Run LDA on Old Testament

	def format_topics_sentences(ldamodel, corpus):
	r"""This function associate to each review the dominant topic
	Parameters
	----------
	lda_model: gensim lda_model
	The current lda model calculated

	corpus: gensim corpus
	this is the corpus from the reviews

Steboss89 / LDA_oldtest.py

Created May 31, 2022 21:01

LDA on Old Testament books

	def format_topics_sentences(ldamodel, corpus):
	r"""This function associate to each review the dominant topic
	Parameters
	----------
	lda_model: gensim lda_model
	The current lda model calculated

	corpus: gensim corpus
	this is the corpus from the reviews

Steboss89 / results.csv

Created May 17, 2022 09:14

Results from BrainGB for the HIV dataset

Design	Method		HIV
		Accuracy	F1	AUC
GNN	Connection rofile	65.71 +/- 13.85	64.11 +/- 13.99	75.10 +/- 16.95
Message Passing	Node concat	70.00 +/- 15.91	68.83 +/- 17.57	77.96 +/- 8.20
Attention	Node concat	71.43 +/- 9.04	70.47 +/- 9.26	82.04 +/- 11.21
Pooling	Concat pooling	65.71 +/- 13.85	64.11 +/- 13.99	75.10 v+/- 16.95

Steboss89 / normcdf_words.csv

Created April 13, 2022 13:00

	Old Testament	New Testament
	beauty	angels
	families	apostles/disciples
	cattles	baptized
	cubits	charity
	convenant	cross/cruficied
	valiant	faith

Steboss89 / normcdf_words.csv

Created April 13, 2022 13:00

We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 2 columns, instead of 3 in line 1.

	Old Testament, New Testament
	beauty, angles,
	families, apostles/disciples
	cattles, baptized
	cubits, charity
	convenant, cross/cruficied
	valiant, faith

Steboss89 / normcdf_hmean.py

Created April 13, 2022 10:40

Relate the word's occurrence to the hmean of normcdf of the total and class rate

	def normcdf(x):
	return norm.cdf(x, x.mean(), x.std())

	# compute the rate of a word: word_occurrence_old_test/total
	term_freq_df.loc[:,'old_rate'] = term_freq_df[0] * 1./term_freq_df['total']
	# rate the word appear in a class, in this case old testament word_occurrence_old_test/total_old_test
	term_freq_df.loc[:,'old_freq_pct'] = term_freq_df[0] * 1./term_freq_df[0].sum()
	# combine the total rate and the class rate with the harmonic mean, to weight over most unique and specific words
	term_freq_df.loc[:,'old_hmean'] = term_freq_df.apply(lambda x: (hmean([x['old_rate'], x['old_freq_pct']]) if x['old_rate'] > 0 and x['old_freq_pct'] > 0 else 0), axis=1)
	# where old_rate or old_freq_pct lies in the distribution in terms of cumulative manner.

Steboss89 / word_appearance.csv

Created April 13, 2022 09:42

Old vs New Testament, word occurrences

Word	Old T %	New T %
lord	2.34	0.92
god	1.28	1.84
israel	0.82	0.09
king	0.78	0.10
people	0.70	0.23
jesus	0.00	1.18
man	0.65	1.05
christ	0.07	0.90
father	0.21	0.46

Stefano Bosisio Steboss89