Skip to content

Instantly share code, notes, and snippets.

@Hendekagon
Last active December 8, 2020 18:09
Show Gist options
  • Save Hendekagon/b2f5639d6d56127dbe6b1ec83930e323 to your computer and use it in GitHub Desktop.
Save Hendekagon/b2f5639d6d56127dbe6b1ec83930e323 to your computer and use it in GitHub Desktop.
transducer-vs-unrolled
; I was curious as to whether unrolling
; this use of transducers would be faster
;
; in fact it's slower
; we're given an array which we
; process using fns f, g and h
(defn test1-xf [^bytes data]
(comp
(map (fn [i] (f data i (+ 1 i))))
(partition-all 3)
(map (fn [p] (apply g p)))
(partition-all 2)
(map (fn [p] (apply h p)))))
; using transducers
; Execution time mean : 5.355734 µs
(defn test1 [^bytes data]
(sequence (test1-xf data)
(range 12 (* 12 4) 2)))
; unrolled
; Execution time mean : 8.817436 µs
(defmacro make-test2 []
`(defn ~'test2 [~'data]
~(vec
(sequence
(comp
(map (fn [i] `(f ~'data ~i ~(+ 1 i))))
(partition-all 3)
(map (fn [p] `(g ~@p)))
(partition-all 2)
(map (fn [[a g]] `(h ~a ~g))))
(range 12 (* 12 4) 2)))))
(macroexpand '(make-test2))
=>
(def test2
(clojure.core/fn
([data]
[(h
(g
(f data 12 13)
(f data 14 15)
(f data 16 17))
(g
(f data 18 19)
(f data 20 21)
(f data 22 23)))
(h
(g
(f data 24 25)
(f data 26 27)
(f data 28 29))
(g
(f data 30 31)
(f data 32 33)
(f data 34 35)))
(h
(g
(f data 36 37)
(f data 38 39)
(f data 40 41))
(g
(f data 42 43)
(f data 44 45)
(f data 46 47)))])))
@Hendekagon
Copy link
Author

update: it seems the difference is accounted for by vec in the macro-generated function - if I profile with (quick-bench (vec (test1 test-data))) then I get that the transducer code is a few microseconds slower than the unrolled version

@Hendekagon
Copy link
Author

Hendekagon commented Feb 24, 2020

...or not

Changing the unrolling macro to:

(defmacro make-test2 []
  `(defn ~'test2 [~'data]
    ~(cons 'list
      (sequence
        (comp
          (map (fn [i] `(f ~'data ~i ~(+ 1 i))))
          (partition-all 3)
          (map (fn [p] `(g ~@p)))
          (partition-all 2)
          (map (fn [[a g]] `(h ~a ~g))))
        (range 12 (* 12 4) 2)))))

(quick-bench (test1 tb1))
Evaluation count : 130500 in 6 samples of 21750 calls.
             Execution time mean : 5.213332 µs
    Execution time std-deviation : 193.164539 ns
   Execution time lower quantile : 5.041071 µs ( 2.5%)
   Execution time upper quantile : 5.489332 µs (97.5%)
                   Overhead used : 9.090463 ns
=> nil
(quick-bench (test2 tb1))
Evaluation count : 59136 in 6 samples of 9856 calls.
             Execution time mean : 8.770583 µs
    Execution time std-deviation : 80.080528 ns
   Execution time lower quantile : 8.657538 µs ( 2.5%)
  Execution time upper quantile : 8.861252 µs (97.5%)
                  Overhead used : 9.090463 ns

Still slower! It seems that transducers are as fast as is possible when composing sequences of functional transformations, and unrolling such opertions into ugly code actually results in slightly slower code.

@Hendekagon
Copy link
Author

full benchmarks:

(bench (test2 tb1))
Evaluation count : 5622060 in 60 samples of 93701 calls.
             Execution time mean : 10.728191 µs
    Execution time std-deviation : 68.035173 ns
   Execution time lower quantile : 10.620603 µs ( 2.5%)
   Execution time upper quantile : 10.874791 µs (97.5%)
                   Overhead used : 9.090463 ns

Found 1 outliers in 60 samples (1.6667 %)
	low-severe	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
=> nil
(bench (test1 tb1))
Evaluation count : 10959660 in 60 samples of 182661 calls.
             Execution time mean : 5.583302 µs
    Execution time std-deviation : 51.027292 ns
   Execution time lower quantile : 5.499769 µs ( 2.5%)
   Execution time upper quantile : 5.699451 µs (97.5%)
                   Overhead used : 9.090463 ns

Found 5 outliers in 60 samples (8.3333 %)
	low-severe	 4 (6.6667 %)
	low-mild	 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

@Hendekagon
Copy link
Author

Hendekagon commented Feb 25, 2020

Tufte profling gives the times for partition-all and map:

  pId     nCalls        Min      50% ≤      90% ≤      95% ≤      99% ≤        Max       Mean   MAD       Total   Clock

  :map    360,000   363.00ns     2.24μs     6.16μs     6.95μs    11.76μs     4.71ms     2.99μs  ±80%      1.07s     129%
   :pa    320,000   116.00ns   153.00ns     5.85μs     6.27μs    10.94μs     4.71ms     1.99μs ±115%    638.06ms     76%

Accounted 1.71s 205%
Clock 834.28ms 100%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment