Skip to content

Instantly share code, notes, and snippets.

@mountain
Created February 25, 2014 06:47
Show Gist options
  • Save mountain/9203983 to your computer and use it in GitHub Desktop.
Save mountain/9203983 to your computer and use it in GitHub Desktop.
corpus utilities
(use 'clojure.java.io)
(use 'com.guokr.nlp.seg)
(def corpus-root "/path/to/directory")
(def corpus-dir (clojure.java.io/file corpus-root))
(def corpus-files (file-seq corpus-dir))
(def corpus-writer (writer (str corpus-root "/corpus.txt")))
(defn line-by-line [wrtr lhdl] (fn [file] (with-open [rdr (reader file)] (doseq [line (line-seq rdr)] (.write wrtr (str (lhdl line) "\n"))))))
(defn line-by-file [wrtr lhdl] (fn [file] (with-open [rdr (reader file)] (doseq [line (line-seq rdr)] (.write wrtr (lhdl line))))))
(defn for-corpus [fhdl] (doseq [file corpus-files] (fhdl file)))
(with-open [wrtr (writer (str corpus-root "/corpus.txt"))] (for-corpus (line-by-file wrtr seg)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment