Skip to content

Instantly share code, notes, and snippets.

@bcambel
Created November 3, 2014 14:41
Show Gist options
  • Select an option

  • Save bcambel/b7058d7ea917d320f81a to your computer and use it in GitHub Desktop.

Select an option

Save bcambel/b7058d7ea917d320f81a to your computer and use it in GitHub Desktop.
(defn extract-from-urls
"Takes a directory of tabbed files where URLs are the second field
(after UUID), fetches xhtml either from
dcache or the local cache (depending on doc/*use-local-cache*),
runs all extractors on the xhtml,
and writes JSON strings with extractor name/values and URL to json-
dir.
URLs with parse errors are written to parse-error-dir.
URLs not in dcache are written to cache-miss-dir.
Other errors are written to trap-dir.
If out-prefix is present it is prepended to the output paths."
[url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
prefix]]
(cascalog.io/with-fs-tmp [_ tmp-dir]
(let [extr-tap (hfs-seqfile tmp-dir)
json-tap (hfs-textline (str out-prefix json-dir))
parse-error-tap (hfs-textline (str out-prefix parse-error-dir))
cache-miss-tap (hfs-textline (str out-prefix cache-miss- dir))]
(let [extr-query (make-extractor-query url-dir (str out-prefix trap-dir))]
(?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
(extr-query ?uuid ?url !json !parse-error !cache-miss)))
(?- json-tap
(<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
parse-error-tap
(<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse- error _))
cache-miss-tap
(<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment