Skip to content

Instantly share code, notes, and snippets.

View shashankg7's full-sized avatar

Shashank Gupta shashankg7

View GitHub Profile
@bwhite
bwhite / rank_metrics.py
Created September 15, 2012 03:23
Ranking Metrics
"""Information Retrieval metrics
Useful Resources:
http://www.cs.utexas.edu/~mooney/ir-course/slides/Evaluation.ppt
http://www.nii.ac.jp/TechReports/05-014E.pdf
http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
http://hal.archives-ouvertes.fr/docs/00/72/67/60/PDF/07-busa-fekete.pdf
Learning to Rank for Information Retrieval (Tie-Yan Liu)
"""
import numpy as np
@emaadmanzoor
emaadmanzoor / ExpandEdinburghFSDCorpus.md
Last active October 31, 2020 20:30
Expand the Edinburgh Twitter FSD corpus

Expand The Edinburgh Twitter FSD Corpus

The Python scripts attached here take care of the following tedious work, and should help one quickly get started with some real work on the corpus:

  • Respect the Twitter API rate limits and throttle API hits.
  • Don't hit the API for already expanded tweet ID's, so you can resume tweet expansion after stopping midway.
  • Parse the API response and dump it into the correct column in the sqlite3 database.
  • Gracefully handle exceptions while acquiring tweets from the API.
  • Wrap version 1.1 of the Twitter API.
  • Start from a specified tweet ID, assuming the input file is sorted in increasing order of tweet ID.
@yanofsky
yanofsky / LICENSE
Last active October 17, 2024 22:49
A script to download all of a user's tweets into a csv
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
@waleking
waleking / SparkGibbsLDA.scala
Last active January 31, 2020 11:15
We implement gibbs sampling for LDA by Spark. This version performs much better than alpha version, and now can handle 3196204 words, 100 topics, 1000 sample iterations on server in 161.7 minutes. To solve the long time consuming in collect() process in alpha version, we utilize the cache() method as line 261 and line 262. We also solve a pile o…
package topic
import spark.broadcast._
import spark.SparkContext
import spark.SparkContext._
import spark.RDD
import spark.storage.StorageLevel
import scala.util.Random
import scala.math.{ sqrt, log, pow, abs, exp, min, max }
import scala.collection.mutable.HashMap
@clemsos
clemsos / gensim_workflow.py
Last active February 22, 2022 11:09
How to calculate TF-IDF similarity matrix of a complete corpus with Gensim
#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
This script just show the basic workflow to compute TF-IDF similarity matrix with Gensim
OUTPUT :
@macks22
macks22 / pmf-and-modified-bpmf-pymc.py
Last active May 13, 2021 13:37
Probabilistic Matrix Factorization (PMF) + Modified Bayesian BMF
"""
Implementations of:
Probabilistic Matrix Factorization (PMF) [1],
Bayesian PMF (BPMF) [2],
Modified BPFM (mBPMF)
using `pymc3`. mBPMF is, to my knowledge, my own creation. It is an attempt
to circumvent the limitations of `pymc3` w/regards to the Wishart distribution:
@julienr
julienr / sklearn_classif_report_to_latex.py
Created October 26, 2015 16:04
Parse and convert scikit-learn classification_report to latex
"""
Code to parse sklearn classification_report
"""
##
import sys
import collections
##
def parse_classification_report(clfreport):
"""
Parse a sklearn classification report into a dict keyed by class name
@bishboria
bishboria / springer-free-maths-books.md
Last active October 3, 2024 09:17
Springer made a bunch of books available for free, these were the direct links
@joshloyal
joshloyal / ngram_cnn.py
Created March 11, 2016 15:29
Convolutional Network for Sentence Classification (Keras)
from keras.models import Graph
from keras.layers import containers
from keras.layers.core import Dense, Dropout, Activation, Reshape, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Convolution2D, MaxPooling2D
def ngram_cnn(n_vocab, max_length, embedding_size, ngram_filters=[2, 3, 4, 5], n_feature_maps=100, dropout=0.5, n_hidden=15):
"""A single-layer convolutional network using different n-gram filters.
Parameters
@wjohnson
wjohnson / recsys-pyspark.py
Last active June 17, 2020 10:01
Using Pyspark's ALS Matrix Factorization Model for RecSys
#Get the data here http://grouplens.org/datasets/movielens/
movielens = sc.textFile("../in/ml-100k/u.data")
movielens.first() #u'196\t242\t3\t881250949'
movielens.count() #100000
#Clean up the data by splitting it
#Movielens readme says the data is split by tabs and
#is user product rating timestamp
clean_data = movielens.map(lambda x:x.split('\t'))