Created
November 3, 2014 14:41
-
-
Save bcambel/b7058d7ea917d320f81a to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| (defn extract-from-urls | |
| "Takes a directory of tabbed files where URLs are the second field | |
| (after UUID), fetches xhtml either from | |
| dcache or the local cache (depending on doc/*use-local-cache*), | |
| runs all extractors on the xhtml, | |
| and writes JSON strings with extractor name/values and URL to json- | |
| dir. | |
| URLs with parse errors are written to parse-error-dir. | |
| URLs not in dcache are written to cache-miss-dir. | |
| Other errors are written to trap-dir. | |
| If out-prefix is present it is prepended to the output paths." | |
| [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out- | |
| prefix]] | |
| (cascalog.io/with-fs-tmp [_ tmp-dir] | |
| (let [extr-tap (hfs-seqfile tmp-dir) | |
| json-tap (hfs-textline (str out-prefix json-dir)) | |
| parse-error-tap (hfs-textline (str out-prefix parse-error-dir)) | |
| cache-miss-tap (hfs-textline (str out-prefix cache-miss- dir))] | |
| (let [extr-query (make-extractor-query url-dir (str out-prefix trap-dir))] | |
| (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss] | |
| (extr-query ?uuid ?url !json !parse-error !cache-miss))) | |
| (?- json-tap | |
| (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _)) | |
| parse-error-tap | |
| (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse- error _)) | |
| cache-miss-tap | |
| (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss)))))) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment