Created
February 28, 2012 04:11
-
-
Save sunilnandihalli/1929345 to your computer and use it in GitHub Desktop.
problem with lazy join of two large sorted csv files...
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns pythia.soj.join | |
(:require [clojure.data.csv :as csv] | |
[clojure.java.io :as io] | |
[clojure.java.shell :as sh])) | |
(defn write-csv-record [wrtr record] | |
(binding [*out* wrtr] | |
(println (apply str (interpose \, record))))) | |
(defn map-seq-to-csv-with-out-using-write-csv [seq-of-maps output-file] | |
(let [[all-keys] (with-open [wrtr (io/writer (str output-file ".tmp"))] | |
(reduce (fn [[c-keys keys-to-id-map] m] | |
(let [[n-keys n-keys-to-id-map :as w] (reduce (fn [[c-keys keys-to-id-map] k] | |
(if (contains? keys-to-id-map k) | |
[c-keys keys-to-id-map] | |
[(conj c-keys k) | |
(assoc keys-to-id-map k (count c-keys))])) | |
[c-keys keys-to-id-map] (keys m)) | |
rec ((apply juxt n-keys) m)] | |
(write-csv-record wrtr rec) w)) [[] {}] seq-of-maps))] | |
(sh/sh "sh" :in (str "echo " (apply str (interpose \, (map name all-keys))) " | cat - " output-file ".tmp >" output-file "; rm -vf " output-file ".tmp;")))) | |
(defn csv-to-map-seq [fname & {:keys [with-header key-map] :or {with-header false}}] | |
(let [row-seq (csv/read-csv (io/reader fname)) | |
all-keys (let [premapped-keys (if with-header (map keyword (first row-seq)) (range 1 1000))] | |
(if-not key-map premapped-keys | |
(map (fn [k] (if (contains? key-map k) (key-map k) k)) premapped-keys)))] | |
(map #(zipmap all-keys %) | |
(if with-header (rest row-seq) row-seq)))) | |
(defn lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields | |
([s1 s2 f output-generator] | |
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields s1 s2 f f output-generator)) | |
([s1 s2 f1 f2 output-generator] | |
(lazy-seq | |
(loop [[x & xs :as wx] s1 [y & [yn :as ys] :as wy] s2] | |
(let [[xk yk] [(f1 x) (f2 y)] | |
ck (compare xk yk)] | |
(cond | |
(= ck 0) (cons (output-generator x y) | |
(let [nyk (f2 yn)] | |
(if-not (= nyk xk) | |
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields xs ys f1 f2 output-generator) | |
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields wx ys f1 f2 output-generator)))) | |
(< ck 0) (recur xs wy) | |
(> ck 0) (recur wx ys))))))) | |
(defn join-csv-based-on-field-with-only-second-file-allowed-to-have-duplicate-fields [f1 f2 field-key] | |
(let [[s1 s2] (map #(csv-to-map-seq % :with-header true) [f1 f2])] | |
(lazy-join-sorted-map-seqs-with-only-second-map-seq-allowed-to-have-duplicate-fields s1 s2 field-key merge))) |
when I am writing code, the return value of the above function is directly written to file using map-seq-to-csv-with-out-using-write-csv . I just realized that I had missed saying that..
may I be the first to say "Holy long function names, Batman!". Seriously, this is what name spaces are for....
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
join-csv-based-on-field-with-only-second-file-allowed-to-have-duplicate-fields is the main entry point. f1 and f2 are csv files with headers which are sorted on field key...