Skip to content

Instantly share code, notes, and snippets.

@selfboot
Created November 8, 2015 07:50
Show Gist options
  • Save selfboot/c97c48283c664e49f93f to your computer and use it in GitHub Desktop.
Save selfboot/c97c48283c664e49f93f to your computer and use it in GitHub Desktop.
Mining Massive Datasets Quiz Week7A Advanced.
# First, we index a string of length L on the symbols appearing in its prefix of length floor(0.2L+1). Thus, strings of length 5 and 6 are indexed on their first two symbols, while strings of length 4 are indexed on their first symbol only. Thus, the index for a consists of {s1, s2, s4, s8}; the index for b consists of {s1, s3, s5, s6}, the index for c consists of {s2, s3, s5, s7}, and no other symbol is indexed at all.
# For s1, we examine the indexes for a and b, which contains all strings but s7. Thus, s1 is compared with 6 other strings.
# For s3, we examine the indexes for b and c, which together contain s1, s2, s3, s5, s6, and s7. Thus, s3 is compared with five other strings.
# For s6, we examine only the index for b. Thus, s6 is compared only with the three other strings s1, s3, and s5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment