Skip to content

Instantly share code, notes, and snippets.

@shriphani
Last active January 25, 2016 08:00
Show Gist options
  • Save shriphani/fef9bd1f73abcbdeb2c1 to your computer and use it in GitHub Desktop.
Save shriphani/fef9bd1f73abcbdeb2c1 to your computer and use it in GitHub Desktop.
;; (:require [clj-xpath.core :refer :all]
;; [net.cgrand.enlive-html :as html]
;; [org.bovinegenius.exploding-fish :as uri]
;; [pegasus.core :refer [crawl]])
(defn crawl-sp-blog-xpaths
[]
(crawl {:seeds ["http://blog.shriphani.com/feeds/all.rss.xml"]
:user-agent "Pegasus web crawler"
:extractor
(fn [obj]
;; ensure that we only extract in domain
(when (= "blog.shriphani.com"
(-> obj :url uri/host))
(let [url (:url obj)
resource (try (-> obj
:body
xml->doc)
(catch Exception e nil))
;; extract the articles
articles (map
:text
(try ($x "//item/link" resource)
(catch Exception e nil)))]
;; add extracted links to the supplied object
(merge obj
{:extracted articles}))))
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"}))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment