Created
March 1, 2013 11:58
-
-
Save leonelag/5064195 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; | |
;; Utility script to convert files between character encodings. | |
;; | |
;; I've used this to fix the character encoding in a messy project where files | |
;; were encoded with different encodings | |
;; | |
(require '[clojure.java.io :as io]) | |
(defn has-suffix [f suffixes] | |
(some #(.endsWith (.getName f) %) | |
suffixes)) | |
(def files | |
(concat | |
(filter #(has-suffix % [".java" ".ui.xml"]) | |
(file-seq (io/file "C:/MyProject/src"))))) | |
;; | |
;; Basic Latin and Control characters | |
;; http://en.wikipedia.org/wiki/C0_Controls_and_Basic_Latin | |
;; | |
;; Latin1 Supplement | |
;; http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement | |
;; | |
;; Latin characters in Unicode | |
;; http://en.wikipedia.org/wiki/Latin_characters_in_Unicode | |
;; | |
(defn valid-char? [character] | |
(let [ch (int (.charValue character))] | |
(or (#{ \newline \tab \return } character) | |
(<= 0x20 ch 0x3F) ; Space, punctuation marks, digits. | |
(<= 0x40 ch 0x7D) ; Letters | |
(#{\à \á \â \ã \ç \è \é \ê \í ; Non-exhaustive list of letters with accents. | |
\À \Á \Â \Ã \Ç \È \É \Ê \Í | |
\õ \ó \ô \ú \û \ü | |
\Õ \Ó \Ô \Ú \Û } character)))) | |
(defn invalid-chars [f] | |
(let [contents (slurp f :encoding "utf-8")] | |
(filter (complement valid-char?) | |
contents))) | |
;; not used | |
(defn convert-encoding [f from-encoding to-encoding] | |
(let [contents (slurp f :encoding from-encoding)] | |
(spit f contents :encoding to-encoding))) | |
;; | |
;; Prints the codes of invalid characters in a file tree | |
;; | |
(doseq [f files] | |
(let [invalid (invalid-chars f)] | |
(when (not (empty? invalid)) | |
(println (.getAbsolutePath f) | |
(map (fn [ch] | |
[(Integer/toHexString (int (.charValue ch))) | |
ch]) | |
invalid))))) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Clojure script to convert files to a different character encoding.
I've used this in a project where files where written in Brazilian Portuguese, so the allowed characters are the ones more present in Portuguese, with special regard to accented characters.
When writing this script, the following pages were useful:
Basic Latin and Control characters
http://en.wikipedia.org/wiki/C0_Controls_and_Basic_Latin
Latin1 Supplement
http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement
Latin characters in Unicode
http://en.wikipedia.org/wiki/Latin_characters_in_Unicode