-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement fast isascii for String #30400
Conversation
KristofferC
commented
Dec 15, 2018
Why does btime show 1 allocation? |
Probably some allocation with passing the closure wrapping |
The loop version does seem somewhat faster though, at least on v1.0.0: function isascii_loop(s::String)
for i = 1:sizeof(s)
if @inbounds(codeunit(s, i)) >= 0x80; return false; end
end
return true
end Has the timing julia> @btime isascii_loop($s)
4.857 μs (0 allocations: 0 bytes)
true vs. the julia> @btime isascii2($s)
5.917 μs (1 allocation: 16 bytes)
true |
I'll rewrite this with the loop then. Might as well squeeze out everything we can from such a simple function. |
For more fun on a rainy day: julia> function isascii_faster(s::String)
stride = 32
l = sizeof(s)
l2 = l - (l % (stride - 1))
for i = 1:stride:l2
n = ntuple(ii -> @inbounds(codeunit(s, i + ii)), Val(stride))
foldl_fast(|, n...) > 0x80 && return false
end
for i = (l2 + 1):l
@inbounds(codeunit(s, i += 1)) >= 0x80 && return false
end
return true
end;
julia> @inline foldl_fast(op, a, rest...) = isempty(rest) ? a : op(a, foldl_fast1(op, rest...));
Note that the compiler breaks down at N=33 and stops optimizing for us, so I can't test a larger stride. Also, we get almost all of the benefit with a stride of 3 (1.902 μs), and would get better results on small strings, but that's not the point :P |
just a note that in Jameson's code
is a typo, use |
Wouldn't it be even faster to load 8 bytes at a time into a |
In particular, this is 3–4× faster than function isascii_word(s::String)
len = sizeof(s)
nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
p = Ptr{UInt}(pointer(s))
GC.@preserve s for i = 1:nwords
iszero(unsafe_load(p, i) & (0x8080808080808080 % UInt)) || return false
end
for i = nwords*sizeof(UInt)+1:len
@inbounds(codeunit(s, i)) >= 0x80 && return false
end
return true
end |
I've often thought that we should use this kind of technique to optimize other |
For example, here is a function length_fast(s::String) # assumes isvalid(s)
len = sizeof(s)
nchars = 0
nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
p = Ptr{UInt}(pointer(s))
GC.@preserve s for i = 1:nwords
word = unsafe_load(p, i)
mask = word & (0x8080808080808080 % UInt)
nchars += sizeof(UInt)
if !iszero(mask)
nchars += count_ones(word & (mask >> 1)) - count_ones(mask)
end
end
for i = nwords*sizeof(UInt)+1:len
byte = codeunit(s, i)
nchars += iszero(byte & 0x80) | !iszero(byte & 0x40)
end
return nchars
end |
Yes, I was actually hoping to convince it to do 32-bytes at a time (with AVX), but I guess it couldn't match it to the appropriate SIMD kernel. |
Note that the fast code should also apply to We might want to loop over a few bytes in the beginning if the pointer is unaligned. |
Here's a version of my function isascii_fast(s::Union{String,SubString{String}})
len = sizeof(s)
nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
p = Ptr{UInt}(pointer(s))
misalignment = (UInt(p) & 0x07) % Int
if misalignment > 0
misalignment = 8 - misalignment
for i = 1:misalignment
@inbounds(codeunit(s, i)) >= 0x80 && return false
end
p += misalignment
end
GC.@preserve s for i = 1:nwords
iszero(unsafe_load(p, i) & (0x8080808080808080 % UInt)) || return false
end
for i = nwords*sizeof(UInt)+1+misalignment:len
@inbounds(codeunit(s, i)) >= 0x80 && return false
end
return true
end I agree that we could potentially do even better with AVX instructions, but it doesn't seem worth the trouble to try to get LLVM to emit these here. |
Yeah, I've been experimenting with unrolling / reinterpreting but not gaining SIMD makes it a bit sad. Using something like https://github.com/KristofferC/SIMDIntrinsics.jl we can of course just write the SIMD manually (but probably need to deal with alignment) (edit: pasted wrong code below before): using SIMDIntrinsics.LLVM
function isascii_simd(s::String)
len = sizeof(s)
nwords = len >> 5
_0x80 = LLVM.constantvector(0x80, LLVM.LVec{32, UInt8})
p = pointer(s)
i = 0
GC.@preserve s for _ in 1:nwords
v = LLVM.load(LLVM.LVec{32, UInt8}, p + i)
comp = LLVM.and(v, _0x80)
# Want the next 3 expressions to turn into the `vptest` assembly instruction, this is one way of doing so, might exist more convenient ones.
u = LLVM.bitcast(LLVM.LVec{4, UInt64}, comp)
u1, u2, u3, u4 = LLVM.extractelement(u, 0), LLVM.extractelement(u, 1),
LLVM.extractelement(u, 2), LLVM.extractelement(u, 3)
iszero(u1 | u2 | u3 | u4) || return false
i += 32
end
for i = nwords*32+1:len
@inbounds(codeunit(s, i)) >= 0x80 && return false
end
return true
end julia> @btime isascii($s); # generic on master
10.370 μs (0 allocations: 0 bytes)
julia> @btime isascii_PR($s); # 1 bytes per iteration (PR)
4.877 μs (0 allocations: 0 bytes)
julia> @btime isascii_word($s); # 8 bytes per iteration by stevengj
619.540 ns (0 allocations: 0 bytes)
julia> @btime isascii_simd($s); # 32 bytes per iteration
218.646 ns (0 allocations: 0 bytes) |
For higher throughput we could unroll by 4: using SIMDIntrinsics.LLVM
function isascii_simd(s::String)
len = sizeof(s)
nwords = len >> 7
_0x80 = LLVM.constantvector(0x80, LLVM.LVec{32, UInt8})
p = pointer(s)
i = 0
GC.@preserve s for _ in 1:nwords
comp = LLVM.constantvector(0x00, LLVM.LVec{32, UInt8})
for _ in 1:4
v = LLVM.load(LLVM.LVec{32, UInt8}, p + i)
comp_i = LLVM.and(v, _0x80)
comp = LLVM.add(comp, comp_i)
i += 32
end
u = LLVM.bitcast(LLVM.LVec{4, UInt64}, comp)
u1, u2, u3, u4 = LLVM.extractelement(u, 0), LLVM.extractelement(u, 1),
LLVM.extractelement(u, 2), LLVM.extractelement(u, 3)
iszero(u1 | u2 | u3 | u4) || return false
end
for i = nwords*32*4+1:len
@inbounds(codeunit(s, i)) >= 0x80 && return false
end
return true
end julia> @btime isascii_simd($s)
121.291 ns (0 allocations: 0 bytes) The SIMD loop turns into something like:
|
What's the status of the SIMD experimentations you posted above? Do you think we should use one of them instead of the (much simpler) method from this PR? |
There was some discussion on the triage call of what would be required so that LLVM can figure out how to optimize this on its own. The main missing feature seems to be the ability to communicate to LLVM that it can assume that |