Created
May 7, 2010 08:36
-
-
Save cgrand/393194 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; in reply to http://www.sids.in/blog/2010/05/06/html-parsing-in-clojure-using-htmlcleaner/ | |
(ns html-parser | |
(:require [net.cgrand.enlive-html :as e])) | |
(defn parse-page | |
"Given the HTML source of a web page, parses it and returns the :title | |
and the tag-stripped :content of the page. Does not do any encoding | |
detection, it is expected that this has already been done." | |
[page-src] | |
(-> page-src java.io.StringReader. e/html-resource | |
(e/at [#{:script :style}] nil) | |
(e/let-select [[title] [:title], [body] [:body]] | |
{:title (e/text title), :content (e/text body)}))) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks a ton for posting this, this is so much nicer than using HtmlCleaner. I've put off going through the Enlive tutorial for far too long now; after seeing this little snippet, I'm unwilling to put it off any more.