lisanka93’s gists

lisanka93 / mergecols.py

Last active July 13, 2020 15:26

Merging several columns in python pandas

	df['ColumnA'] = df[df.columns[1:]].apply(
	lambda x: ' '.join(x.dropna().astype(str)),
	axis=1
	)

lisanka93 / stopword_removal.py

Created July 13, 2020 15:37

removing stopwords with NLTK

	from nltk.corpus import stopwords
	from nltk.tokenize import word_tokenize

	example_sent = "This is a sample sentence, showing off the stop words filtration."
	stop_words = set(stopwords.words('english'))
	word_tokens = word_tokenize(example_sent)

	filtered_sentence = [w for w in word_tokens if not w in stop_words]
	print(filtered_sentence)

lisanka93 / ngrams.py

Created July 13, 2020 15:53

NLTK ngrams, bigrams and trigrams

	from nltk.util import ngrams, word_tokenize, bigrams, trigrams

	sen = "Dummy sentence to demonstrate bigrams"
	nltk_tokens = word_tokenize(sen) #using tokenize from NLKT and not split() because split() does not take into account punctuation

	#splitting sentence into bigrams and trigrams
	print(list(bigrams(nltk_tokens)))
	print(list(trigrams(nltk_tokens)))

	#creating a dictionary that shows occurances of n-grams in text

lisanka93 / regex_punct.py

Created July 13, 2020 15:59

regular expressions to remove punctuation

	import re

	#letters only
	raw_text = "this is a test. To demonstrate 2 regex expressions!!"
	letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

	#keep numbers
	letnum_text = re.sub("[^a-zA-Z0-9\s]+", " ",raw_text )

lisanka93 / lem_stem.py

Created July 13, 2020 16:09

lemmatisation and stemming with NLTK

	from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer

	stemmer = PorterStemmer()
	lemmatizer = WordNetLemmatizer()

	word = "considering"

	stemmed_word = stemmer.stem(word)
	lemmatised_word = lemmatizer.lemmatize(word)

lisanka93 / Dummy movie dataset.ipynb

Created July 13, 2020 16:12

notebook dummy movie dataset

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

lisanka93 / bow.csv

Created August 3, 2020 16:07

Bag of words model explained

lisanka93 / read_in_covid_data.py

Created August 3, 2020 16:24

reading and and preprocessing anti covid-19 vaccine arguments

	covid_data = pd.read_csv('covid_vacc_concerns.csv')
	covid_data['prep_arg'] = covid_data['arg'].apply(preprocess)

lisanka93 / train_test.py

Created August 3, 2020 16:25

splitting data into train and test set

	X_train, X_test, y_train, y_test = train_test_split(
	covid_data['prep_arg'],
	covid_data['concern'],
	test_size=0.2,
	random_state=50
	)

lisanka93 / countvectoriser.py

Created August 3, 2020 16:26

instantiating countvectorizer and learning vocabulary and transforming arguments into vectors

	count_vectorizer = CountVectorizer(binary=True)
	#fit training data
	training_data = count_vectorizer.fit_transform(X_train)

	#transform test data
	testing_data = count_vectorizer.transform(X_test)

Lisa Andreevna lisanka93