Created
October 5, 2020 08:21
-
-
Save borkdude/fc64444a4e7aea4eb647ce42888d1adf to your computer and use it in GitHub Desktop.
Extract HTML tables with babashka and bootleg
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns scrape | |
(:require [babashka.pods :as pods] | |
[clojure.walk :as walk])) | |
(pods/load-pod "bootleg") ;; installed on path, use "./bootleg" for local binary | |
(require '[babashka.curl :as curl]) | |
(def clojure-html (:body (curl/get "https://en.wikipedia.org/wiki/Clojure"))) | |
(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]]) | |
(def hiccup (convert-to clojure-html :hiccup)) | |
(def tables (atom [])) | |
(walk/postwalk (fn [node] | |
(when (and (vector? node) | |
(= :table (first node))) | |
(swap! tables conj node)) | |
node) | |
hiccup) | |
(count @tables) ;; 15 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Enlive is more for transforming. But we can hack it
Hickory is elegant at selection and extraction: