Skip to content

Improve allunique performance for strings, avoid computing length several times for iterators #50644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bvdmitri
Copy link
Contributor

This PR fixes #50360 .

master:

julia> VERSION
v"1.11.0-DEV.142"

julia> using Random

julia> using BenchmarkTools

julia> s=randstring(100000);

julia> @btime allunique($s)
  171.250 μs (4 allocations: 336 bytes)
false

julia> @btime allunique($s[1:100]) && allunique($s)
  258.552 ns (5 allocations: 456 bytes)
false

This PR:

julia> VERSION
v"1.11.0-DEV.142"

julia> using Random

julia> using BenchmarkTools

julia> s=randstring(100000);

julia> @btime allunique($s)
  72.362 ns (4 allocations: 336 bytes)
false

julia> @btime allunique($s[1:100]) && allunique($s)
  91.118 ns (5 allocations: 456 bytes)
false

@Jay-sanjay
Copy link
Contributor

ok I guess then the issue is resolved

@bvdmitri
Copy link
Contributor Author

@oscardssmith There is failing test on CI:

Error in testset abstractarray:
--
  | Error During Test at /cache/build/default-amdci4-6/julialang/julia-master/julia-a96fdcc811/share/julia/test/abstractarray.jl:762
  | Test threw exception
  | Expression: #= /cache/build/default-amdci4-6/julialang/julia-master/julia-a96fdcc811/share/julia/test/abstractarray.jl:762 =# @inferred(cat3v(As)) == zeros(2, 2, 2)
  | return type Array{Float64, 3} does not match inferred return type Array{_A, 3} where _A

I'm not sure how is it related to the changes made in this PR. I tried to execute the test locally and it passed just in REPL.

julia> using Test

julia> As = [zeros(2, 2) for _ = 1:2]
2-element Vector{Matrix{Float64}}:
 [0.0 0.0; 0.0 0.0]
 [0.0 0.0; 0.0 0.0]

julia> @test @inferred(cat(As...; dims=Val(3))) == zeros(2, 2, 2)
Test Passed

julia> cat3v(As) = cat(As...; dims=Val(3))
cat3v (generic function with 1 method)

julia> @test @inferred(cat3v(As)) == zeros(2, 2, 2)
Test Passed

julia> @test @inferred(cat(As...; dims=Val((1,2)))) == zeros(4, 4)
Test Passed

I couldn't run the whole testset locally with make testall due to some problems with the fetching some data from the internet?

There is also a whitecheck failing, says there is a trailing whitespace. Is there a command to automatically fix all code-style related problems?

allunique(A::StridedArray) = length(A) < 32 ? _indexed_allunique(A) : _hashed_allunique(A)
function allunique(S::String)
nunits = ncodeunits(S) # computing `length` for a string might be expensive
nunits < 32 ? _indexed_allunique(S, nunits) : _hashed_allunique(S, nunits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance this looks dangerous, but _indexed_allunique doesn't actually index. I wonder if it should be called _iterate_allunique or _quadratic_allunique or something?

Comment on lines +561 to +562
# strings
@test !allunique("aaaa")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present we have:

julia> allunique("óá")
true

julia> allunique("óá")
false

julia> collect("óá")
2-element Vector{Char}:
 'ó': Unicode U+00F3 (category Ll: Letter, lowercase)
 'á': Unicode U+00E1 (category Ll: Letter, lowercase)

julia> collect("óá")
4-element Vector{Char}:
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

What answers do we want? These ones match unique.

If they are right, they should probably be tested.

(I know this isn't what this PR wanted to do, but it does add a String method.)

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Aug 1, 2023

ping @mcabbott

@LilithHafner
Copy link
Member

 return type Array{Float64, 3} does not match inferred return type Array{_A, 3} where _A

This is not your fault

There is also a whitecheck failing, says there is a trailing whitespace. Is there a command to automatically fix all code-style related problems?

Not that I know of, but the check should also list the line number.

@bvdmitri
Copy link
Contributor Author

bvdmitri commented Aug 7, 2023

What is the current status of this PR? I think the interesting example brought by @mcabbott with collect("óá") should really be addressed by a different PR (if that is an issue). Current PR aims to improve the general performance of allunique and (presumably) does not alter any existing behaviour.

@LilithHafner
Copy link
Member

What is the current status of this PR?

My two biggest concerns are

  • CI is failing (you can click on "details" next to the failed tests to see what exactly failed). It's true that we have pretty unreliable CI and oftentimes failures are not the fault of the PR author. However, we still need to look into test failures every time to confirm that they are not related. In this case, I believe this PR introduces some new failures.
  • I am unaware of any code that calls allunique(::AbstractString) and can't think of any use case that warrants calling that method, and am hesitant to add code to optimize the performance of a method that I don't know is ever used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allunique(::String) performance anomaly
5 participants