Skip to content

Instantly share code, notes, and snippets.

@shriphani
Last active January 25, 2016 08:01
Show Gist options
  • Save shriphani/35407ab0d43644e56887 to your computer and use it in GitHub Desktop.
Save shriphani/35407ab0d43644e56887 to your computer and use it in GitHub Desktop.
Crawl My Blog Using Enlive Selectors
;; (:require [clj-xpath.core :refer :all]
;; [net.cgrand.enlive-html :as html]
;; [org.bovinegenius.exploding-fish :as uri]
;; [pegasus.core :refer [crawl]])
(defn crawl-sp-blog
[]
(crawl {:seeds ["http://blog.shriphani.com"]
:user-agent "Pegasus web crawler"
:extractor
(fn [obj]
;; ensure that we only extract in domain
(when (= "blog.shriphani.com"
(-> obj :url uri/host))
(let [url (:url obj)
resource (-> obj
:body
(StringReader.)
html/html-resource)
;; extract the articles
articles (html/select resource
[:article :header :h2 :a])
;; the pagination links
pagination (html/select resource
[:ul.pagination :a])
a-tags (concat articles pagination)
;; resolve the URLs and stay within the same domain
links (filter
#(= (uri/host %)
"blog.shriphani.com")
(map
#(->> %
:attrs
:href
(uri/resolve-uri (:url obj)))
a-tags))]
;; add extracted links to the supplied object
(merge obj
{:extracted links}))))
:corpus-size 20 ;; crawl 20 documents
:job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment