Skip to content

Instantly share code, notes, and snippets.

View VibhuJawa's full-sized avatar
🏠
Working from home

Vibhu Jawa VibhuJawa

🏠
Working from home
  • Nvidia
  • Santa Clara
View GitHub Profile
@VibhuJawa
VibhuJawa / chunksize_experiments.ipynb
Last active August 8, 2019 04:11
This notebook show cases how chunk_sizes impact performance
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@VibhuJawa
VibhuJawa / collab_py_37.sh
Created July 28, 2019 19:41
Install rapids collab_py_37 to github
#!/bin/bash
set -eu
wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/env-check.py
echo "Checking for GPU type:"
python env-check.py
if [ ! -f Miniconda3-4.5.4-Linux-x86_64.sh ]; then
echo "Removing conflicting packages, will replace with RAPIDS compatible versions"
@VibhuJawa
VibhuJawa / normalized_count_array.py
Created July 25, 2019 23:59
normalized_count_array
normalized_count_array = count_dary/np.sum(count_dary,axis=1)[:,None]
@VibhuJawa
VibhuJawa / gutenburg_read_tokenize_gv100_run.ipynb
Last active August 26, 2019 18:27
Gutenburg read tokenize gv100_run
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
## chunksize = 1.9 M, 10% of dataset
small_df = df.head(1_900_000).copy()
%time ouptput_df = preprocess_text_df(small_df,filter_regex=filters_regex)
## chunksize = 950K, 5% of dataset
small_df = df.head(950_000).copy(deep=True)
%time ouptput_df = preprocess_text_df(small_df,filter_regex=filters_regex)
## chunksize = 190k M, 1% of dataset
small_df = df.head(190_000).copy(deep=True)
%time ouptput_df = preprocess_text_df(small_df,filter_regex=filters_regex)
author_id = author_name_ls.index('Charles Dickens')
for index in output_indices_umap[author_id]:
print(author_name_ls[int(index)])
def preprocess_text(input_strs ,filter_regex,stop_words = nltk.corpus.stopwords.words('english')):
"""
* filter punctuation
* to_lower
* remove stop words (from nltk corpus) (taking the most time)
* remove multiple space with one
* remove leading and trailing spaces
"""
author_id = author_name_ls.index('Charles Dickens')
for index in output_indices_umap[author_id]:
print(author_name_ls[int(index)])