Skip to content

Instantly share code, notes, and snippets.

@sumanthprabhu
sumanthprabhu / tfidf.sql
Last active March 29, 2023 14:03
A rough SQL implementation of tf-idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf> Assuming you have run the snippet <https://gist.github.com/sumanthprabhu/8066438> to generate a count matrix and stored it in the file 'tfidf.csv', the output will be a table named "results" containing normalized scores for each term per tag.
load local data infile 'tfidf.csv' into table tfidf fields terminated by "|" lines terminated by '\n'(term, tag, count);
DELIMITER //
CREATE PROCEDURE tfidf_applier()
begin
declare res1 INT;
set res1 = (select count(distinct tag) from tfidf);
drop table if exists log_term_table;
create table log_term_table(term varchar(200), logval decimal(20,5));
@sumanthprabhu
sumanthprabhu / count_matrix.py
Created December 21, 2013 07:22
A map-reduce based python script to generate the count matrix. Given a set of input statements and the tags/documents each statement is associated with, this script can be used to count the number of times each term in the statement occurs for each tag/document. Pre-requisites: Each input statement must be pre-processed into the form "term1;term…
'''
Count the number of occurences of each term for each tag
(or in each document)
Arguments :
Input file where each input line is of the form :
term1;term2;term3.. , associated_tag1;associated_tag2..
Output:
Basically a count matrix with each line of the form
@sumanthprabhu
sumanthprabhu / merge.py
Last active January 1, 2016 00:19
A python utility script to merge files in a directory. Pass two arguments - 1) the path to the directory containing files to be merged 2) required number of files after merging
'''
Combine all files in a directory into required number of files
'''
import csv
import sys
import os
import time
def fetch(num):