Skip to content

Instantly share code, notes, and snippets.

import os
import gzip
import json
import re
import string
import pprint
import esmre
from collections import defaultdict, deque
from senti_classifier import senti_classifier
import requests
@narulkargunjan
narulkargunjan / package-list
Created July 7, 2014 09:21
A crude package listing for setting up a data guy's R environment
#Below is the list of packages that are typically required for any data guy!
#Beware 1 - THIS IS, BY NO MEANS, A "COMPLETE" LIST, JUST WHAT I FEEL APPROPRIATE.
#Beware 2 - MAKE SURE INTERNET CONNECTION IS FAST AND RUNNING FOR ALL THE TIME.
install.packages("vars")
install.packages("forecast")
install.packages("ggplot2")
install.packages("rattle")
install.packages("caret")
install.packages("e1071")
@narulkargunjan
narulkargunjan / gridsearch_basic
Created July 8, 2014 12:26
Search the parameter space using parallel processing and plot the heatmap
# source: http://statcompute.wordpress.com/2013/06/01/grid-search-for-free-parameters-with-parallel-computing/
library(MASS)
data(Boston)
X <- I(as.matrix(Boston[-14]))
st.X <- scale(X)
Y <- I(as.matrix(Boston[14]))
boston <- data.frame(X = st.X, Y)
# DIVIDE THE WHOLE DATA INTO TWO SEPARATE SETS
set.seed(2013)
@narulkargunjan
narulkargunjan / gridsearch_caret
Created July 8, 2014 12:27
Search parameter grid using Caret
##sources: http://caret.r-forge.r-project.org/training.html, http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
set.seed(107) #set seed to ensure reproduction if required
## do the setup for parallel processing as per the system available (ensure "allowparallel/seed" are set up accordingly)
## for unix/ubuntu etc:
#library(doMC)
#registerDoMC(cores = 5)
##for windows:
#library(doParallel)

Hive partitioning scheme for dealing with late arriving data etc.

Over the last few years I've been quite involved with using hive for big data analysis.

I've read many web tutorials and blogs about using hadoop/hive/pig for data analysis but all them seem to be over simplified and targeted as a "my first hive query" kind of audience instead of showing how to structure hive tables and queries for real word use cases eg years of data, reoccurring batch jobs to build aggregate/reporting tables and having to deal with late arriving data etc.

Most of these tutorials look something like this

Twitter Data -> hdfs/external hive table external hive table -> hive query -> results.

@narulkargunjan
narulkargunjan / K_Means_Clustering.R
Created August 20, 2017 10:43
Provides simple code for K-means clustering with deciding the right K and scores the new dataset for the right clusters.
#read data in r
iris <- read.csv("C:/Users/Ashwin/Desktop/segmentation/CSV Fishers Iris Data.csv")
View(iris)
summary(iris)
head(iris)
# Randomise data for making little realistic
iris<-iris[sample(1:nrow(iris)),]
@narulkargunjan
narulkargunjan / image2image_match.R
Created August 20, 2017 18:17
We can compare our own picture or an image that we found on the web to many types of image databases.
## This code is based the code of Roald Bradley Severtson :
## https://github.com/Microsoft/microsoft-r/tree/master/microsoft-ml/Samples/PreTrainedModels/ImageAnalytics/ImageFeaturizer
library(MicrosoftML)
## Change NA to the actual location of the script. Use the absolute path.
workingDir <- "C:/Users/redelang/Documents/Code/projects/image_featurizer/image_featurizer"
if (is.na(workingDir)){
stop("The working directory needs to be set to the location of the script.")
@narulkargunjan
narulkargunjan / NLP_Demo.py
Last active August 9, 2023 02:47
Topic Modeling (LDA/Word2Vec) with Spacy
import os
import codecs
data_directory = os.path.join('..', 'data',
'yelp_dataset_challenge_academic_dataset')
businesses_filepath = os.path.join(data_directory,
'yelp_academic_dataset_business.json')
with codecs.open(businesses_filepath, encoding='utf_8') as f:
@narulkargunjan
narulkargunjan / HappyBase_Sample.py
Created September 1, 2017 13:59
Sample HappyBase Sample for accessing HBase using Python
import csv
import happybase
import time
batch_size = 1000
host = "0.0.0.0"
file_path = "Request_for_Information_Cases.csv"
namespace = "sample_data"
row_count = 0
start_time = time.time()
@narulkargunjan
narulkargunjan / python27_installation.sh
Created September 6, 2017 13:42
Installing python2.7 on CentOS which already is dependent on python 2.6
#Run as root
yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel
# --no-check-certificate Optional
cd /opt
wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
tar xf Python-2.7.6.tar.xz
cd Python-2.7.6