Skip to content

Instantly share code, notes, and snippets.

@simonster
Last active August 29, 2015 14:11
Show Gist options
  • Save simonster/b1b4cc2ad0daa8e20a99 to your computer and use it in GitHub Desktop.
Save simonster/b1b4cc2ad0daa8e20a99 to your computer and use it in GitHub Desktop.
missingness sum benchmark
immutable CheapMulBool
x::Bool
end
immutable Negated{T}
x::T
end
immutable BitWrapper
x::Vector{Uint64}
end
Base.getindex(x::BitWrapper, i) = CheapMulBool(!Base.unsafe_bitgetindex(x.x, i))
Base.getindex(x::Negated, i) = CheapMulBool(unsafe_load(convert(Ptr{Uint8}, pointer(x.x)), i) == 0)
*(x::CheapMulBool, y::Number) = ifelse(x.x, y, zero(y))
function f(x, y)
z = zero(eltype(x))
@simd for i = 1:length(x)
@inbounds z += y[i]*x[i]
end
z
end
x = randn(10000000)
ybool = rand(Bool, length(x))
ybit = bitpack(ybool)
yfloat = convert(Vector{Float64}, !ybool)
f(x, Negated(ybool))
f(x, BitWrapper(ybit.chunks))
f(x, yfloat)
@time for i = 1:50 f(x, Negated(ybool)) end
@time for i = 1:50 f(x, BitWrapper(ybit.chunks)) end
@time for i = 1:50 f(x, yfloat) end
@simonster
Copy link
Author

julia> @time for i = 1:50 f(x, Negated(ybool)) end
elapsed time: 0.378913358 seconds (1600 bytes allocated)

julia> @time for i = 1:50 f(x, BitWrapper(ybit.chunks)) end
elapsed time: 1.023556274 seconds (1600 bytes allocated)

julia> @time for i = 1:50 f(x, yfloat) end
elapsed time: 0.507922349 seconds (800 bytes allocated)

@simonster
Copy link
Author

Or with LLVM SVN:

julia> @time for i = 1:50 f(x, Negated(ybool)) end
elapsed time: 0.235930301 seconds (1600 bytes allocated)

julia> @time for i = 1:50 f(x, BitWrapper(ybit.chunks)) end
elapsed time: 1.765737635 seconds (1600 bytes allocated)

julia> @time for i = 1:50 f(x, yfloat) end
elapsed time: 0.494715477 seconds (800 bytes allocated)

@simonster
Copy link
Author

If the BitArray indexing is unrolled, we can do quite well with BitArrays in LLVM SVN and julia -O:

@eval begin
function f(x, y::BitArray)
    chunks = y.chunks
    $([:($(symbol("z_$i")) = zero(eltype(x))) for i = 1:16]...)
    j = 0
    @inbounds @simd for i = 1:length(chunks)
        bits = chunks[i]
        $([:($(symbol("z_$((j-1)%16+1)")) += ifelse(bits & (uint64(1) << $(j-1)) != 0, x[j+1], zero(eltype(x))); j += 1) for j = 1:64]...)
    end
    z = 0.0
    $([:(z += $(symbol("z_$i"))) for i = 1:16]...)
    z
end
end
julia> @time for i = 1:50 f(x, ybit) end
elapsed time: 0.264525551 seconds (800 bytes allocated)

Even this appears to be suboptimal, since LLVM isn't vectorizing the loads from x and isn't using AVX2 instructions on my Haswell CPU.

@simonster
Copy link
Author

And for comparison, the NaN-based approach, again on LLVM SVN:

function g(x)
    z = zero(eltype(x))
    @simd for i = 1:length(x)
        @inbounds z += ifelse(isnan(x[i]), zero(eltype(x[i])), x[i])
    end
    z
end
julia> @time for i = 1:50; g(x); end
elapsed time: 0.206984494 seconds (800 bytes allocated)

(This is basically the same as an ordinary sum.)

@nalimilan
Copy link

So the moral seems to be that the BitArray approach can potentially be made quite fast in the future. Any idea why you observe a slow down with LLVM SVN compared to 3.4 (or 3.5?) when not unrolling?

@simonster
Copy link
Author

The comparison is to LLVM 3.3, which partially vectorizes the loop whereas LLVM SVN doesn't vectorize it at all. I can't explain why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment