Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement fast isascii for String #30400

Merged
merged 3 commits into from
Feb 14, 2019
Merged

implement fast isascii for String #30400

merged 3 commits into from
Feb 14, 2019

Conversation

KristofferC
Copy link
Member

julia> s = randstring(10^4);

julia> @btime isascii($s)
  10.364 μs (0 allocations: 0 bytes)
true

julia> @btime isascii2($s) # PR version
  4.448 μs (1 allocation: 16 bytes)
true

@KristofferC KristofferC added performance Must go faster strings "Strings!" labels Dec 15, 2018
@ararslan ararslan requested a review from stevengj December 16, 2018 23:11
@stevengj
Copy link
Member

Why does btime show 1 allocation?

@KristofferC
Copy link
Member Author

Probably some allocation with passing the closure wrapping s into another function. I can rewrite it as a loop to get rid of it but don't think it really affects performance.

@thchr
Copy link
Contributor

thchr commented Dec 17, 2018

The loop version does seem somewhat faster though, at least on v1.0.0:

function isascii_loop(s::String)
    for i = 1:sizeof(s)
        if @inbounds(codeunit(s, i)) >= 0x80; return false; end
    end
    return true
end

Has the timing

julia> @btime isascii_loop($s)
  4.857 μs (0 allocations: 0 bytes)
true

vs. the all(...) version's

julia> @btime isascii2($s)
  5.917 μs (1 allocation: 16 bytes)
true

@KristofferC
Copy link
Member Author

I'll rewrite this with the loop then. Might as well squeeze out everything we can from such a simple function.

base/strings/string.jl Outdated Show resolved Hide resolved
@vtjnash
Copy link
Member

vtjnash commented Dec 20, 2018

For more fun on a rainy day:

julia> function isascii_faster(s::String)
           stride = 32
           l = sizeof(s)
           l2 = l - (l % (stride - 1))
           for i = 1:stride:l2
               n = ntuple(ii -> @inbounds(codeunit(s, i + ii)), Val(stride))
               foldl_fast(|, n...) > 0x80 && return false
           end
           for i = (l2 + 1):l
               @inbounds(codeunit(s, i += 1)) >= 0x80 && return false
           end
           return true
       end;
julia> @inline foldl_fast(op, a, rest...) = isempty(rest) ? a : op(a, foldl_fast1(op, rest...));
julia> @btime isascii(s)
  11.014 μs (0 allocations: 0 bytes)
true

julia> @btime isascii_PR(s)
  4.716 μs (1 allocation: 16 bytes)
true

julia> @btime isascii_faster(s)
  1.740 μs (0 allocations: 0 bytes)

Note that the compiler breaks down at N=33 and stops optimizing for us, so I can't test a larger stride. Also, we get almost all of the benefit with a stride of 3 (1.902 μs), and would get better results on small strings, but that's not the point :P

@smldis
Copy link

smldis commented Dec 21, 2018

just a note that in Jameson's code
@inline foldl_fast(op, a, rest...) = isempty(rest) ? a : op(a, foldl_fast1(op, rest...));

foldl_fast1

is a typo, usefoldl_fast. I confirm the benefit in my kaby lake laptop julia 1.0.0

@stevengj
Copy link
Member

stevengj commented Dec 21, 2018

Wouldn't it be even faster to load 8 bytes at a time into a w::UInt8 and check iszero(w & 0x8080808080808080)?

@stevengj
Copy link
Member

stevengj commented Dec 21, 2018

In particular, this is 3–4× faster than isascii_faster on my machine:

function isascii_word(s::String)
    len = sizeof(s)
    nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
    p = Ptr{UInt}(pointer(s))
    GC.@preserve s for i = 1:nwords
        iszero(unsafe_load(p, i) & (0x8080808080808080 % UInt)) || return false
    end
    for i = nwords*sizeof(UInt)+1:len
        @inbounds(codeunit(s, i)) >= 0x80 && return false
    end
    return true
end

@stevengj
Copy link
Member

stevengj commented Dec 21, 2018

I've often thought that we should use this kind of technique to optimize other String operations for the common case of mostly-ASCII strings: load in 8 bytes at a time (on a 64-bit CPU) and switch to a fast path if the bytes are ASCII.

@stevengj
Copy link
Member

stevengj commented Dec 21, 2018

For example, here is a length function that reads in sizeof(UInt) bytes at a time, and seems to be 2–10x faster than our current length(s::String) for both ASCII and non-ASCII strings, assuming valid UTF-8 data:

function length_fast(s::String) # assumes isvalid(s)
    len = sizeof(s)
    nchars = 0
    nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
    p = Ptr{UInt}(pointer(s))
    GC.@preserve s for i = 1:nwords
        word = unsafe_load(p, i)
        mask = word & (0x8080808080808080 % UInt)
        nchars += sizeof(UInt)
        if !iszero(mask)
            nchars += count_ones(word & (mask >> 1)) - count_ones(mask)
        end
    end
    for i = nwords*sizeof(UInt)+1:len
        byte = codeunit(s, i)
        nchars += iszero(byte & 0x80) | !iszero(byte & 0x40)
    end
    return nchars
end

@vtjnash
Copy link
Member

vtjnash commented Dec 21, 2018

Yes, I was actually hoping to convince it to do 32-bytes at a time (with AVX), but I guess it couldn't match it to the appropriate SIMD kernel.

@stevengj
Copy link
Member

Note that the fast code should also apply to SubString{String}.

We might want to loop over a few bytes in the beginning if the pointer is unaligned.

@stevengj
Copy link
Member

Here's a version of my isascii code above which handles substrings and misaligned pointers:

function isascii_fast(s::Union{String,SubString{String}})
    len = sizeof(s)
    nwords = len >> @static(Sys.WORD_SIZE == 64 ? 3 : 2)
    p = Ptr{UInt}(pointer(s))
    misalignment = (UInt(p) & 0x07) % Int
    if misalignment > 0
        misalignment = 8 - misalignment
        for i = 1:misalignment
            @inbounds(codeunit(s, i)) >= 0x80 && return false
        end
        p += misalignment
    end
    GC.@preserve s for i = 1:nwords
        iszero(unsafe_load(p, i) & (0x8080808080808080 % UInt)) || return false
    end
    for i = nwords*sizeof(UInt)+1+misalignment:len
        @inbounds(codeunit(s, i)) >= 0x80 && return false
    end
    return true
end

I agree that we could potentially do even better with AVX instructions, but it doesn't seem worth the trouble to try to get LLVM to emit these here.

@KristofferC
Copy link
Member Author

KristofferC commented Dec 21, 2018

Yeah, I've been experimenting with unrolling / reinterpreting but not gaining SIMD makes it a bit sad. Using something like https://github.com/KristofferC/SIMDIntrinsics.jl we can of course just write the SIMD manually (but probably need to deal with alignment) (edit: pasted wrong code below before):

using SIMDIntrinsics.LLVM
function isascii_simd(s::String)
    len = sizeof(s)
    nwords = len >> 5
    _0x80 = LLVM.constantvector(0x80, LLVM.LVec{32, UInt8})
    p = pointer(s)
    i = 0
    GC.@preserve s for _ in 1:nwords
        v = LLVM.load(LLVM.LVec{32, UInt8}, p + i)
        comp = LLVM.and(v, _0x80)
        # Want the next 3 expressions to turn into the `vptest` assembly instruction, this is one way of doing so, might exist more convenient ones.
        u = LLVM.bitcast(LLVM.LVec{4, UInt64}, comp)
        u1, u2, u3, u4 = LLVM.extractelement(u, 0), LLVM.extractelement(u, 1), 
                         LLVM.extractelement(u, 2), LLVM.extractelement(u, 3)
        iszero(u1 | u2 | u3 | u4) || return false
        i += 32
    end
    for i = nwords*32+1:len
        @inbounds(codeunit(s, i)) >= 0x80 && return false
    end
    return true
end
julia> @btime isascii($s); # generic on master
  10.370 μs (0 allocations: 0 bytes)

julia> @btime isascii_PR($s); # 1 bytes per iteration (PR)
  4.877 μs (0 allocations: 0 bytes)

julia> @btime isascii_word($s); # 8 bytes per iteration by stevengj
  619.540 ns (0 allocations: 0 bytes)

julia> @btime isascii_simd($s); # 32 bytes per iteration
  218.646 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Member Author

KristofferC commented Dec 22, 2018

For higher throughput we could unroll by 4:

using SIMDIntrinsics.LLVM
function isascii_simd(s::String)
    len = sizeof(s)
    nwords = len >> 7
    _0x80 = LLVM.constantvector(0x80, LLVM.LVec{32, UInt8})
    p = pointer(s)

    i = 0
    GC.@preserve s for _ in 1:nwords
        comp = LLVM.constantvector(0x00, LLVM.LVec{32, UInt8})
        for _ in 1:4
            v = LLVM.load(LLVM.LVec{32, UInt8}, p + i)
            comp_i = LLVM.and(v, _0x80)
            comp = LLVM.add(comp, comp_i)
            i += 32
        end
        u = LLVM.bitcast(LLVM.LVec{4, UInt64}, comp)
        u1, u2, u3, u4 = LLVM.extractelement(u, 0), LLVM.extractelement(u, 1), 
                         LLVM.extractelement(u, 2), LLVM.extractelement(u, 3)
        iszero(u1 | u2 | u3 | u4) || return false
    end
    for i = nwords*32*4+1:len
        @inbounds(codeunit(s, i)) >= 0x80 && return false
    end
    return true
end
julia> @btime isascii_simd($s)
  121.291 ns (0 allocations: 0 bytes)

The SIMD loop turns into something like:

        vpand   -96(%edx), %ymm0, %ymm1
        vpand   -64(%edx), %ymm0, %ymm2
        vpaddb  %ymm1, %ymm2, %ymm1
        vpand   -32(%edx), %ymm0, %ymm2
        vpand   (%edx), %ymm0, %ymm3
        vpaddb  %ymm3, %ymm2, %ymm2
        vpaddb  %ymm2, %ymm1, %ymm1
        vptest  %ymm1, %ymm1

@StefanKarpinski StefanKarpinski added backport 1.1 triage This should be discussed on a triage call and removed backport 1.1 labels Jan 31, 2019
@JeffBezanson JeffBezanson removed backport 1.1 triage This should be discussed on a triage call labels Feb 14, 2019
@JeffBezanson JeffBezanson merged commit d8e4ce4 into master Feb 14, 2019
@JeffBezanson JeffBezanson deleted the KristofferC-patch-8 branch February 14, 2019 20:45
@nalimilan
Copy link
Member

What's the status of the SIMD experimentations you posted above? Do you think we should use one of them instead of the (much simpler) method from this PR?

@StefanKarpinski
Copy link
Member

There was some discussion on the triage call of what would be required so that LLVM can figure out how to optimize this on its own. The main missing feature seems to be the ability to communicate to LLVM that it can assume that codeunit(s, i) is a valid memory access and thus side-effect free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants