Skip to content

Instantly share code, notes, and snippets.

@sids
sids / html_parser.clj
Created May 6, 2010 05:44
HTML Parsing in Clojure using HtmlCleaner.
(ns in.grok.history.html-parser
(:require [clojure.contrib.logging :as log])
(:import [org.htmlcleaner HtmlCleaner]
[org.apache.commons.lang StringEscapeUtils]))
(defn parse-page
"Given the HTML source of a web page, parses it and returns the :title
and the tag-stripped :content of the page. Does not do any encoding
detection, it is expected that this has already been done."
[page-src]
require 'whatever/classifier'
classifier = Classifier.new
# training:
classifier.add_document(:text => "blah", :class => :a)
classifier.add_document(:text => "bleh", :class => :b)
# get model; training is automatically finalised
model = classifier.get_model
Engineer by choice, Researcher at heart, Entrepreneur by nature.
Siddhartha has over 4 years of experience working in the broad areas of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing; with a focus on application to the World Wide Web.
He started working on Information Retrieval as a research when pursuing a Masters degree in Information Technology at IIIT-Bangalore in 2005,06. After graduation, he joined Ziva Software, a mobile search startup, in 2006. He has since been working on various aspects of developing and running Ziva's flagship mobile search engine Zook as a part of the core 3-member tech team.
Siddhartha's contribution at Ziva has been manifold, including, but not limited to, system architecture for the various sub-systems (crawling, extraction, processing, indexing and search), user experience, API,
Siddhartha blogs on these topics at http://grok.in/ (The Art Of Information Engineering).
@sids
sids / domain.htaccess
Created July 11, 2009 09:53
Apache mod_rewrite tricks
### Forcing a particular domain ###
RewriteCond %{HTTP_HOST} !^www.sids.in$ [NC]
RewriteRule ^(.*)$ http://www.sids.in$1 [R=301,L]
@sids
sids / uri_escape.pl
Created October 29, 2008 09:10
Perl: URI encoding/decoding
#!/usr/bin/perl
use strict; use warnings;
use URI::Escape;
# http://www.perlhowto.com/encode_and_decode_url_strings
my $string = "Hello world!";
my $encode = uri_escape($string);
print "Original string: $string\n";
@sids
sids / Recipe: Tomato Pappu
Created October 21, 2008 14:52
Mom's recipe for Tomato Pappu :)
Recipe: Tomato Pappu
^^^^^^^^^^^^^^^^^^^^
Ingredients
===========
Toor Dal : 1 measure
Tomatoes : 4-5 measures
Turmeric Powder : a pinch
Oil : 1 tea spoon / measure
@sids
sids / gist:9850
Created September 10, 2008 08:01
git tips
## setup
git config user.name "Siddhartha Reddy"
git config user.email ""
## setup -- global
git config --global user.name "Siddhartha Reddy"
git config --global user.email ""
git config --global color.branch "auto"
git config --global color.status "auto"
git config --global color.diff "auto"
@sids
sids / sshd_config
Created August 13, 2008 09:22
Linux: sshd config tips
# sshd config tips
# add these lines at the end of /etc/ssh/sshd_config
# don't try reverse-dns lookup on connecting clients (this speeds up the initial connection establishment a lot):
UseDNS no
# make sure the client does not get disconnect due to being idle (ping every 5 mins):
ClientAliveInterval 300
@sids
sids / gist:5209
Created August 13, 2008 07:06
Linux: Setting up auxiliary swap
# setting up auxiliary swap in linux
dd if=/dev/zero of=/auxswap bs=1M count=2048
mkswap /auxswap
swapon /auxswap
# trun it on:
swapon -s
# turn it off:
@sids
sids / README
Created July 24, 2008 12:20
Perl: database handling examples
http://gist.github.com/2125
Examples of database handling in Perl.