Skip to content

Instantly share code, notes, and snippets.

View primaryobjects's full-sized avatar

Kory Becker primaryobjects

View GitHub Profile
@primaryobjects
primaryobjects / mmds-q1.R
Last active March 16, 2020 22:11
Mining Massive Datasets Quiz 1
# Q1
#
# Suppose we compute PageRank with a β of 0.7, and we introduce the additional constraint that the sum of the PageRanks of the three pages must be 3, to handle the problem that otherwise any multiple of a solution will also be a solution. Compute the PageRanks a, b, and c of the three pages A, B, and C, respectively. Then, identify from the list below, the true statement.
#
# Matrix
#
# A B C
# A 0 0 0
# B 0.5 0 0
# C 0.5 1 1
@primaryobjects
primaryobjects / mmds-q2a.R
Last active November 7, 2016 10:14
Mining Massive Datasets Quiz 2a: LSH (Basic)
#
# Quiz 2a
#
#
# Q1
# The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify which of the following is a true statement about the number of pairs at a certain edit distance.
#
packages <- c('combinat', 'stringdist')
@primaryobjects
primaryobjects / mmds-q3a.R
Created September 28, 2015 21:08
Mining Massive Datasets - adjacency matrix, degree matrix, laplacian matrix.
Q1
C -- D -- E
/ | | | \
A | | | B
\ | | | /
F -- G -- H
Write the adjacency matrix A, the degree matrix D, and the Laplacian matrix L. For each, find the sum of all entries and the number of nonzero entries. Then identify the true statement from the list below.
@primaryobjects
primaryobjects / mmds-q4a.R
Created October 7, 2015 20:32
Mining Massive Datasets Q4a - normalizing ratings, cosine distance, recommender systems, collaborative filtering
# Q1
# Here is a table of 1-5 star ratings for five movies (M, N, P. Q. R) by three raters (A, B, C).
# M N P Q R
# A 1 2 3 4 5
# B 2 3 2 5 3
# C 5 5 5 3 2
# Normalize the ratings by subtracting the average for each row and then subtracting the average for each column in the resulting table. Then, identify the true statement about the normalized table.
# First, setup the data.
ratings <- data.frame(M = c(1, 2, 5), N = c(2, 3, 5), P = c(3, 2, 5), Q = c(4, 5, 3), R = c(5, 3, 2))
@primaryobjects
primaryobjects / similar-sentences.R
Last active October 12, 2015 20:42
Mining Massive Datasets Programming Assignment 2: Finding Similar Sentences
#
# MMDS Programming Assignment 2: Finding Similar Sentences
# This assignment is an optional challenge and it won't count in your final grade.
# Your task is to quickly find the number of pairs of sentences that are at the word-level edit distance at most 1. Two sentences S1 and S2 they are at edit distance 1 if S1 can be transformed to S2 by: adding, removing or substituting a single word.
# For example, consider the following sentences where each letter represents a word:
# S1: A B C D
# S2: A B X D
# S3: A B C
# S4: A B X C
# Then pairs the following pairs of sentences are at word edit distance 1 or less: (S1, S2), (S1, S3), (S2, S4), (S3, S4).
@primaryobjects
primaryobjects / mmds-q5b.R
Created October 13, 2015 14:59
Mining Massive Datasets: Clustering, k-means, balance algorithm, bipartite graph.
#
# Mining Massive Datasets Quiz 5B
#
# Q1
# We wish to cluster the following set of points: into 10 clusters. We initially choose each of the green points (25,125), (44,105), (29,97), (35,63), (55,63), (42,57), (23,40), (64,37), (33,22), and (55,20) as a centroid. Assign each of the gold points to their nearest centroid. (Note: the scales of the horizontal and vertical axes differ, so you really need to apply the formula for distance of points; you can't just "eyeball" it.) Then, recompute the centroids of each of the clusters. Do any of the points then get reassigned to a new cluster on the next round? Identify the true statement in the list below. Each statement refers either to a centroid AFTER recomputation of centroids (precise to one decimal place) or to a point that gets reclassified.
# Setup data.
centroids <- t(data.frame(c(25,125), c(44,105), c(29,97), c(35,63), c(55,63), c(42,57), c(23,40), c(64,37), c(33,22), c(55,20)))
points <- t(data.frame(c(28,145), c(65,140), c(50,130), c(55,118), c(38,115
@primaryobjects
primaryobjects / nchoosek.txt
Created October 13, 2015 15:44
n choose k
n choose k
n! / k!(n - k)!
How many pairs in: x,y,a,b,c,d,e,f
8C2 = 8! / 2! * (8 - 2)! = 40320 / (2 * 6!) = 40320 / (2 * 720) = 40320 / 1440 = 28
xy,xa,xb,xc,xd,xe,xf
ya,yb,yc,yd,ye,yf
ab,ac,ad,ae,af
bc,bd,be,bf
@primaryobjects
primaryobjects / mmds-q6a.R
Created October 19, 2015 18:08
Mining Massive Datasets - Map Reduce 6a
# mmds Week6A
# Q1
# Using the matrix-vector multiplication described in Section 2.3.1, applied to the matrix and vector:
# 1 2 3 4
# 5 6 7 8
# 9 10 11 12
# 13 14 15 16
#
# 1
# 2
@primaryobjects
primaryobjects / mmds-q7a.R
Created October 26, 2015 15:35
Mining Massive Datasets - 7a LSH Family, Hash Functions
#
# Q1
# Suppose we have an LSH family h of (d1,d2,.6,.4) hash functions. We can use three functions from h and the AND-construction to form a (d1,d2,w,x) family, and we can use two functions from h and the OR-construction to form a (d1,d2,y,z) family. Calculate w, x, y, and z, and then identify the correct value of one of these in the list below.
#
val1 <- .6
val2 <- .4
# AND construction
w <- val1 ^ 3
x <- val2 ^ 3
@primaryobjects
primaryobjects / friendlyUrl.R
Last active November 14, 2015 00:02
R method to convert a sentence into a friendly url.
#
# Converts text into a friendly url.
# Examples:
# friendlyUrl('She sells seashells by the seashore.') -> "she-sells-seashells-by-the-seashore"
# friendlyUrl('Learn To Program: 1,000 "Languages" in 2015.') -> "learn-to-program-1-000-languages-in-2015"
#
friendlyUrl <- function(text, sep = '-', max = 80) {
# Replace non-alphanumeric characters.
url <- gsub('[^A-Za-z0-9]', sep, text)