Skip to content

Instantly share code, notes, and snippets.

@arr-ee
Created July 4, 2014 22:16
Show Gist options
  • Save arr-ee/07d668ea7a91339dc7e9 to your computer and use it in GitHub Desktop.
Save arr-ee/07d668ea7a91339dc7e9 to your computer and use it in GitHub Desktop.
Tiny Instapaper scraper
(ns instagroup.core
(:require [net.cgrand.enlive-html :as html]
[clj-http.client :as http]
[clj-http.cookies :as cookies]
[clojure.java.io :as io]
[clojure.string :as s]
[clojure.pprint :as pp])
(:import java.io.ByteArrayInputStream)
(:gen-class))
(def urls {:login "https://www.instapaper.com/user/login"
:pages "https://www.instapaper.com/u/%d"})
(def creds (-> "creds.edn" slurp read-string))
(def cookie-store (cookies/cookie-store))
(defn string-input-stream [#^String s]
"Returns a ByteArrayInputStream for the given String."
(ByteArrayInputStream. (.getBytes s)))
(def articles [:article])
(def article-title [:.article_inner_item :.title_row :.article_title])
(def article-link [:.article_inner_item :.title_meta :.host :a])
(def paginator [:.main_content :.paginate_older])
(defn get-article-id [article] (->> article :attrs :data-article-id Integer/parseInt))
(defn get-article-title [article] (-> article (html/select article-title) first html/text s/trim))
(defn get-article-link [article] (-> article (html/select article-link) first :attrs :href))
(defn process-article [article]
{:id (get-article-id article)
:title (get-article-title article)
:link (get-article-link article)})
(defn articles-seq
([cookie-store] (articles-seq cookie-store 1))
([cookie-store page-number]
(let [articles (-> (format (:pages urls) page-number)
(http/get {:cookie-store cookie-store})
:body
string-input-stream
html/html-resource
(html/select articles))]
(concat articles (lazy-seq (articles-seq cookie-store (inc page-number)))))))
(defn -main
[& args]
(http/post (:login urls) {:form-params creds :cookie-store cookie-store}))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment