Skip to content

bugfixes for rsplit (for unicode and empty strings) #26238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 1, 2018

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Feb 27, 2018

Fixes #26225 (rsplit failure for strings with multi-byte Unicode characters).

Also, I noticed that rsplit was broken in both 0.6 and 0.7 for empty splitter strings: split("abc", "") gives ["a","b","c"] (correct, I think — and there was a test for this so it was clearly the intended behavior), but rsplit("abc", "") on 0.6 incorrectly gives ["ab","c"]. In this PR, they both give ["a","b","c"], and the case with limit=2 is also tested.

The new rsplit implementation is both more correct (I think) and shorter.

I also noticed that both rsplit and split were incorrectly(?) using start(string) to get firstindex. Or is this always guaranteed to return 1?

Finally, I simplified the split/rsplit tests by replacing isequal with ==, since the two should be equivalent for strings. == is more readable, allowed me to use chained comparisons, and gives better diagnostic output for @test failures.

@stevengj stevengj added strings "Strings!" bugfix This change fixes an existing bug labels Feb 27, 2018
@stevengj
Copy link
Member Author

stevengj commented Feb 27, 2018

@StefanKarpinski, is firstindex(string) guaranteed to always be 1 for all AbstractString types? If so, the code could be simplified further (and the docs should be clarified). If not, the _split implementation (and possibly a lot of other Base string code) is wrong.

Update: From this comment, "We want indices to always be 1 through ncodeunits(s)." I will update the PR accordingly.

@stevengj
Copy link
Member Author

stevengj commented Mar 1, 2018

CI failures seem unrelated. Timeouts on FreeBSD and Win64, and rmdir: Permission denied on Win32.

Okay to merge?

@@ -275,9 +275,9 @@ split(str::T, splitter::Char;
_split(str, equalto(splitter), limit, keep, T <: SubString ? T[] : SubString{T}[])

function _split(str::AbstractString, splitter, limit::Integer, keep_empty::Bool, strs::Array)
i = start(str)
@assert (i = firstindex(str)) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to assert this? Why do it here and not elsewhere? I think implementations which get this wrong will realize it quite soon even without assertions given that may things will fail.

Copy link
Member Author

@stevengj stevengj Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asserting it because the subsequent code implicitly relies on it. Basically, I think of this as a comment with some added value. Assertions are cheap so we shouldn't hesitate to add them, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i = firstindex(str)) though goes against the recently added documention that use of @assert might not work in the future (#25610)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will fix.

Copy link
Member

@StefanKarpinski StefanKarpinski Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really since the argument here is to delete it which is exactly what the compiler would do. On the contrary, this is exactly the way that @assert should be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanKarpinski, the point is that you can't completely delete this code because it contains an assignment. Hence this is not correct usage of @assert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. That's true. I missed that bit.

@StefanKarpinski StefanKarpinski merged commit cde00cf into JuliaLang:master Mar 1, 2018
@stevengj stevengj deleted the rsplitfix branch March 1, 2018 17:40
stevengj added a commit that referenced this pull request Mar 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants