Skip to content

Instantly share code, notes, and snippets.

@fbmnds
Last active December 19, 2015 14:19
Show Gist options
  • Select an option

  • Save fbmnds/5968406 to your computer and use it in GitHub Desktop.

Select an option

Save fbmnds/5968406 to your computer and use it in GitHub Desktop.
Lazy-seqs fail for sequential IO
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;; !!!! WARNING !!!!
;;
;; This code is intented to be analysed in it´s flaws.
;; It does NOT claim to demonstrate good coding practice.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(require '[clojure.java.io :as io])
(def in-file "/path/to/datafile-4gb.tsv")
;; issue 1: line-seq does not release the opened file
;; you will want to bind the io/reader to a var,
;; which will be manually closed after usage.
;;
;; issue 2: def'ing the lazy line-seq to a var keeps the head of it,
;; hence it won´t be garbage collected.
;;
;; remark: I can count a line-seq'ed file without blowing memory,
;; with very good performance (16GB with 0.0008 ms/line),
;; i.e. I can use a lazy-seq for sequential IO on a file > main memory.
;; But note also, that count relies on Java interop, see (source count).
;;
(def in-seq (line-seq (io/reader in-file)))
(def out-file "/path/to/datafile-16gb.tsv")
;; issue 3: really bad mistake:
;; recursion without recur!
;;
(defn multiply-string [s n]
(if (= n 1)
s
(str s \newline (multiply-string s (dec n))))) ; bails out for n ~ 10000
;; issue 4: doall enforces the realization of the entire seqence in memory.
;; I did not think about dorun, which is exactly for this case.
;;
;; issue 5: I found no reasonable way to consume the lazy-seq line-by-line.
;; Note that I did not take the mechanism of Malcom Sparks´
;; state machine into account because of it´s performance penalty.
;; (BTW: does this imply that Clojure´s structural sharing does only work
;; for very small seqs < 250 items?)
;; If you introduce ref´ to counter this effect, I´d argue that you´d give
;; up on lazy-seq. Hence, I turned to explicite iteration.
;;
(defn multiply-copy [in-seq wrtr n]
(doall (map #(.write wrtr (str % \newline)) (map (fn [x] (multiply-string x n)) in-seq))))
;; issue 6: I did not find a way to use with-open for the input and output file.
;; If I use with-open on the input file, I can only consume the first
;; line, after which the input file was closed.
;; This is the reason, why I tried to wrap the lazy-seq into a function,
;; which inherently lead to issue 3.
;;
;; issue 7: Originally, I wanted to have the input file name as parameter.
;; Passing a in-seq creeped in, while I tried to use with-open.
;;
(defn multiply-file [in-seq out-file n]
(let [wrtr (io/writer out-file)]
(multiply-copy in-seq wrtr n)
(.close wrtr)))
(time (multiply-file in-seq out-file 4))
;;
user> (time (multiply-file in-seq out-file 4))
OutOfMemoryError GC overhead limit exceeded java.lang.reflect.Method.copy (Method.java:150)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment