Last active
December 8, 2020 18:09
-
-
Save Hendekagon/b2f5639d6d56127dbe6b1ec83930e323 to your computer and use it in GitHub Desktop.
transducer-vs-unrolled
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
; I was curious as to whether unrolling | |
; this use of transducers would be faster | |
; | |
; in fact it's slower | |
; we're given an array which we | |
; process using fns f, g and h | |
(defn test1-xf [^bytes data] | |
(comp | |
(map (fn [i] (f data i (+ 1 i)))) | |
(partition-all 3) | |
(map (fn [p] (apply g p))) | |
(partition-all 2) | |
(map (fn [p] (apply h p))))) | |
; using transducers | |
; Execution time mean : 5.355734 µs | |
(defn test1 [^bytes data] | |
(sequence (test1-xf data) | |
(range 12 (* 12 4) 2))) | |
; unrolled | |
; Execution time mean : 8.817436 µs | |
(defmacro make-test2 [] | |
`(defn ~'test2 [~'data] | |
~(vec | |
(sequence | |
(comp | |
(map (fn [i] `(f ~'data ~i ~(+ 1 i)))) | |
(partition-all 3) | |
(map (fn [p] `(g ~@p))) | |
(partition-all 2) | |
(map (fn [[a g]] `(h ~a ~g)))) | |
(range 12 (* 12 4) 2))))) | |
(macroexpand '(make-test2)) | |
=> | |
(def test2 | |
(clojure.core/fn | |
([data] | |
[(h | |
(g | |
(f data 12 13) | |
(f data 14 15) | |
(f data 16 17)) | |
(g | |
(f data 18 19) | |
(f data 20 21) | |
(f data 22 23))) | |
(h | |
(g | |
(f data 24 25) | |
(f data 26 27) | |
(f data 28 29)) | |
(g | |
(f data 30 31) | |
(f data 32 33) | |
(f data 34 35))) | |
(h | |
(g | |
(f data 36 37) | |
(f data 38 39) | |
(f data 40 41)) | |
(g | |
(f data 42 43) | |
(f data 44 45) | |
(f data 46 47)))]))) |
...or not
Changing the unrolling macro to:
(defmacro make-test2 []
`(defn ~'test2 [~'data]
~(cons 'list
(sequence
(comp
(map (fn [i] `(f ~'data ~i ~(+ 1 i))))
(partition-all 3)
(map (fn [p] `(g ~@p)))
(partition-all 2)
(map (fn [[a g]] `(h ~a ~g))))
(range 12 (* 12 4) 2)))))
(quick-bench (test1 tb1))
Evaluation count : 130500 in 6 samples of 21750 calls.
Execution time mean : 5.213332 µs
Execution time std-deviation : 193.164539 ns
Execution time lower quantile : 5.041071 µs ( 2.5%)
Execution time upper quantile : 5.489332 µs (97.5%)
Overhead used : 9.090463 ns
=> nil
(quick-bench (test2 tb1))
Evaluation count : 59136 in 6 samples of 9856 calls.
Execution time mean : 8.770583 µs
Execution time std-deviation : 80.080528 ns
Execution time lower quantile : 8.657538 µs ( 2.5%)
Execution time upper quantile : 8.861252 µs (97.5%)
Overhead used : 9.090463 ns
Still slower! It seems that transducers are as fast as is possible when composing sequences of functional transformations, and unrolling such opertions into ugly code actually results in slightly slower code.
full benchmarks:
(bench (test2 tb1))
Evaluation count : 5622060 in 60 samples of 93701 calls.
Execution time mean : 10.728191 µs
Execution time std-deviation : 68.035173 ns
Execution time lower quantile : 10.620603 µs ( 2.5%)
Execution time upper quantile : 10.874791 µs (97.5%)
Overhead used : 9.090463 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
=> nil
(bench (test1 tb1))
Evaluation count : 10959660 in 60 samples of 182661 calls.
Execution time mean : 5.583302 µs
Execution time std-deviation : 51.027292 ns
Execution time lower quantile : 5.499769 µs ( 2.5%)
Execution time upper quantile : 5.699451 µs (97.5%)
Overhead used : 9.090463 ns
Found 5 outliers in 60 samples (8.3333 %)
low-severe 4 (6.6667 %)
low-mild 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Tufte profling gives the times for partition-all
and map
:
pId nCalls Min 50% ≤ 90% ≤ 95% ≤ 99% ≤ Max Mean MAD Total Clock
:map 360,000 363.00ns 2.24μs 6.16μs 6.95μs 11.76μs 4.71ms 2.99μs ±80% 1.07s 129%
:pa 320,000 116.00ns 153.00ns 5.85μs 6.27μs 10.94μs 4.71ms 1.99μs ±115% 638.06ms 76%
Accounted 1.71s 205%
Clock 834.28ms 100%
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
update: it seems the difference is accounted for by
vec
in the macro-generated function - if I profile with(quick-bench (vec (test1 test-data)))
then I get that the transducer code is a few microseconds slower than the unrolled version