Last active
December 19, 2015 14:19
-
-
Save fbmnds/5968406 to your computer and use it in GitHub Desktop.
Lazy-seqs fail for sequential IO
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | |
| ;; | |
| ;; !!!! WARNING !!!! | |
| ;; | |
| ;; This code is intented to be analysed in it´s flaws. | |
| ;; It does NOT claim to demonstrate good coding practice. | |
| ;; | |
| ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | |
| (require '[clojure.java.io :as io]) | |
| (def in-file "/path/to/datafile-4gb.tsv") | |
| ;; issue 1: line-seq does not release the opened file | |
| ;; you will want to bind the io/reader to a var, | |
| ;; which will be manually closed after usage. | |
| ;; | |
| ;; issue 2: def'ing the lazy line-seq to a var keeps the head of it, | |
| ;; hence it won´t be garbage collected. | |
| ;; | |
| ;; remark: I can count a line-seq'ed file without blowing memory, | |
| ;; with very good performance (16GB with 0.0008 ms/line), | |
| ;; i.e. I can use a lazy-seq for sequential IO on a file > main memory. | |
| ;; But note also, that count relies on Java interop, see (source count). | |
| ;; | |
| (def in-seq (line-seq (io/reader in-file))) | |
| (def out-file "/path/to/datafile-16gb.tsv") | |
| ;; issue 3: really bad mistake: | |
| ;; recursion without recur! | |
| ;; | |
| (defn multiply-string [s n] | |
| (if (= n 1) | |
| s | |
| (str s \newline (multiply-string s (dec n))))) ; bails out for n ~ 10000 | |
| ;; issue 4: doall enforces the realization of the entire seqence in memory. | |
| ;; I did not think about dorun, which is exactly for this case. | |
| ;; | |
| ;; issue 5: I found no reasonable way to consume the lazy-seq line-by-line. | |
| ;; Note that I did not take the mechanism of Malcom Sparks´ | |
| ;; state machine into account because of it´s performance penalty. | |
| ;; (BTW: does this imply that Clojure´s structural sharing does only work | |
| ;; for very small seqs < 250 items?) | |
| ;; If you introduce ref´ to counter this effect, I´d argue that you´d give | |
| ;; up on lazy-seq. Hence, I turned to explicite iteration. | |
| ;; | |
| (defn multiply-copy [in-seq wrtr n] | |
| (doall (map #(.write wrtr (str % \newline)) (map (fn [x] (multiply-string x n)) in-seq)))) | |
| ;; issue 6: I did not find a way to use with-open for the input and output file. | |
| ;; If I use with-open on the input file, I can only consume the first | |
| ;; line, after which the input file was closed. | |
| ;; This is the reason, why I tried to wrap the lazy-seq into a function, | |
| ;; which inherently lead to issue 3. | |
| ;; | |
| ;; issue 7: Originally, I wanted to have the input file name as parameter. | |
| ;; Passing a in-seq creeped in, while I tried to use with-open. | |
| ;; | |
| (defn multiply-file [in-seq out-file n] | |
| (let [wrtr (io/writer out-file)] | |
| (multiply-copy in-seq wrtr n) | |
| (.close wrtr))) | |
| (time (multiply-file in-seq out-file 4)) | |
| ;; | |
| user> (time (multiply-file in-seq out-file 4)) | |
| OutOfMemoryError GC overhead limit exceeded java.lang.reflect.Method.copy (Method.java:150) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment