I found a link to this now disappeared Clojure wiki page giving a pretty good explanantion of Clojure support for fast-math by Rich. I then found it again at http://clojure.org/old_news and on the original thread on the ML. The document illustrates the rationale behind some early Clojure optimisations that are still integral part of Clojure nowadays. The changes Rich describes here are mainly to enable the JVM JIT inlining of primitive operations and bring Clojure performance on primitive types (including arrays) on par with Java.
Below I also added the related IRC chat with some additional explanation.
You can consider this an extended comment to the work encompassing (roughly) git commit https://github.com/clojure/clojure/commit/5d9e87e8a30dfc99fe41c3be0a8ddc5ddd4e1d8d to https://github.com/clojure/clojure/commit/6739ae8284307b899d8589c70a7f3b7c7f75cae7
Dismayed by a recent request to embed Java code in Clojure code (yuck), I've tried over the past week to address the only area in which such an endeavor might be reasonable: to attain the arithmetic performance of the Java primitives, and I'm happy to report much success in making that possible directly in Clojure.
This has traditionally been a difficult area for dynamic languages, with their emphasis on treating objects, including numbers, uniformly. On the JVM, that means 'boxing' numbers into some Object (for Clojure, these Objects are the derivees of Java.lang.Number - Integer, Long etc.) The JVM doesn't have support for fast arithmetic with such Numbers, only for 'primitives', the special types int, long, double, float in Java.
Lisp (and Clojure) imposes an additional constraint, that the math be correct - i.e. integer arithmetic should not silently produce corrupt values due to overflow.
Clojure already supported auto-boxing/unboxing in calls to Java methods that take/return primitives, and type inference that tracks the types of calls in order to determine the types of subsequent calls. The trick is to get to the high-performance primitives without turning Clojure into Java, specifically, avoiding lots of type declarations, and lots of additional complexity in the compiler for generating bytecode for primitives.
All of this leans on a tremendous capability of the JVM's HotSpot runtime - given a call to a function whose body contains a call to a primitive operation, and frequent use in a running program, HotSpot will, at runtime, inline those calls by substituting the primitive operation. This means that Clojure's bytecode generator need not know about, nor generate, the bytecodes for arithmetic primitive operations (it doesn't). Note that this inlining capability is limited to within a single fn definition - across fn calls all args/returns are Objects.
In order to facilitate this I've:
- Added support for direct passing of primitives between Java method calls with no intervening boxing.
- Added support for conditional test of boolean method returns without conversion to Boolean.
- Added support for compiler inlining - a specification to the compiler that a function may be inlined, by providing a macro-like expander to be used to replace a direct call to the function. The difference vs a macro is that the resulting function is a proper fn, and can be used as a value, passed to map/reduce/filter etc.
- Added support for let/loop-bound locals of primitive types, and the inference to support that. They now have the inferred, possibly primitive type of their init-form.
- Added support for recurs that rebind primitive locals without boxing, and type-checking for same.
- Overloaded arithmetic (+,-,*,/,inc,dec,<,<=,>,>= etc) for primitive types where semantics are same.
- Overloaded aget/aset for arrays of primitives
- Added overloaded aclone, alength for arrays of primitives
- New constructor fns for primitive arrays: float-array, int-array, etc
- Added special type hints for primitive arrays - #^ints, #^floats, #^longs, #^doubles
- Coercion ops int/long/float/double now produce primitives
- Added a num coercion which boxes primitives
- Added cast ops ints/longs/floats/doubles which produce int[]/long[]/float[]/double[] types
- Made seq much faster for arrays of primitives
- Added set of "unchecked" operations for utmost performing, but potentially unsafe, integer (int/long) ops: unchecked-add/subtract/multiply/divide/inc/dec
- Added amap and areduce macros for functionally (i.e. non-destructively) processing one or more arrays in order to produce a new array or aggregate value respectively.
The bottom line - rather than write this Java:
static public float asum(float[] xs){
float ret = 0;
for(int i = 0; i < xs.length; i++)
ret += xs[i];
return ret;
}
you can now write this Clojure:
(defn asum [#^floats xs]
(areduce xs i ret (float 0)
(+ ret (aget xs i))))
and the resulting code is exactly the same speed (when run with java -server). Similarly:
static public float[] vmul(float[] x, float[] ys){
final float[] xs = x.clone();
for(int i = 0; i < xs.length; i++)
xs[i] *= ys[i];
return xs;
}
becomes:
(defn amul [#^floats xs #^floats ys]
(amap xs i ret
(* (aget ret i) (aget ys i))))
and the resulting code is exactly the same speed. One notable difference is the behavior of integer primitives (int/long) when overflowing:
(.MAX_VALUE Integer)
2147483647
(+ 2147483647 2147483647) ;generic
4294967294
(+ (.MAX_VALUE Integer) (.MAX_VALUE Integer)) ;primitive
java.lang.ArithmeticException: integer overflow
(+ (num (.MAX_VALUE Integer)) (.MAX_VALUE Integer)) ;generic
4294967294
That is, all numeric literals are boxed, and all boxed arithmetic is generic (e.g. promotes type on overflow), and generic arithmetic is contagious, but unboxed primitives throw exceptions on overflow (except the unchecked- ones, which do as Java does):
(unchecked-add (.MAX_VALUE Integer) (.MAX_VALUE Integer)) ;unchecked, unsafe
-2
The best aspect of this is that you need not do anything special in your initial coding. Quite often these optimizations are unneeded. Then, should a bit of code be the bottleneck, you can speed it up with minor adornment:
(defn foo [n]
(loop [i 1]
(if (< i n)
(recur (inc i))
i)))
(time (foo 100000))
"Elapsed time: 1.428 msecs"
100000
(defn foo2 [n]
(let [n (int n)]
(loop [i (int 0)]
(if (< i n)
(recur (unchecked-inc i))
i)))))
(time (foo2 100000))
"Elapsed time: 0.032 msecs"
100000
Feedback welcome as always, Rich Jun 2, 2008 3:33 PM
IRC Chat Transcript (http://clojure-log.n01se.net/date/2008-06-02.html)
rhickey: http://clojure.org/news/primitive_support.html
16:41 Chouse1: very cool.
The final example on that page is very important.
rhickey: yeah, I don't want people to go crazy over perf
Chouse1: sure, and as justification of the whole solution, as an answer to, "What's wrong with writing little bits of inner-loop code in Java (whether inlined or not)?"
rhickey: given macros, you'll be able to write much clearer code in Clojure than Java
Chouse1: compiler inlining is interesting -- do you generate that for all clojure functions?
rhickey: No, you have to specify the expansion, and it is only worth it for calls to Java that involve primitives
Chouse1: so is there new clojure syntax for that?
rhickey: Plus, once inlined they are bound into the calling fn, no fixes, no dynamic rebinding
Chouse1: huh! I guess that makes sense.
rhickey: The compiler uses the :inline and :inline-arities metadata
plus there's a definline macro
should only be used for one-liners
you can see it in action in boot.clj, math stuff
Chouse1: so that gets inlined when a calling function is defined? that's roughly the same time as a called macro would be expanded, right?
rhickey: right, like a macroexpansion
completely at the option of the compiler, the code should be correct if the compiler chooses not to inline
Chouse1: the way lisp mixes compile/eval steps with each other still occasionally blows my mind.
seq is much faster for Java arrays, but fast enough to leave out areduce?
er, "not fast enough..?"
rhickey: (def fa (float-array (range 1000000)))
user=> (time (reduce + fa))
"Elapsed time: 41.666 msecs"
4.9994036E11
user=> (time (asum fa))
"Elapsed time: 1.446 msecs"
4.9994036E11
fast and faster
you choose
if you wre doing serious graphics/audio stuff, you'd go with the latter
otherwise, use whatever is clearest
the vectors are no slouches either:
(def va (map float (range 1000000)))
user=> (time (reduce + va))
"Elapsed time: 128.676 msecs"
4.9994036E11
That's a persistent collection and all boxed numbers
Chouse1: is primmath.clj not loaded by default like xml.clj and the others are?
rhickey: primmath is history
Chouse1: oh. no wonder it doesn't work. ;-)
rhickey: will be gone soon
Chouse1: where's asum?
rhickey: I haven't decided on what pre-defined array fns I'll supply, given I've just given everyone the tools to make their own
primmath.clj is now gone
Chouse1: ok, I was just trying to reproduce your timing results.
rhickey: the amap code is in the news item
Chouse1: ah. :-)
rhickey: asum
amap and areduce are in boot.clj, asum and amul are in the news item
Chouse1: Would this approximate the behavior prior to your recent changes? (reduce + (into-array (range 1000000)))
rhickey: you should make the array outside of the time call to be fair
Chouse1: oh, of course.
rhickey: also note that if you are using integers, only the generic math will work
for that number
also into-array will make an array of boxed, you could have done (make-array (. Integer TYPE) 1000000)
but the dominant difference there will be primitive array seq, which used reflection. The new code is very much faster, before you get into the math ops
manipulation of arrays of primitives is hundreds of times faster than it was