Skip to content

Limit custom character class ranges to single scalars #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Handle case insensitivity properly in CCC ranges
The prior implementation didn't make a lot of sense, and couldn't
handle cases like `/(?i)[X-c]/`. This new approach uses simple case
matching to test if the character is within the range, then tests if
the uppercase or lowercase mappings are within the range.

Fixes #395
  • Loading branch information
natecook1000 committed May 19, 2022
commit b716d50c504fb1c37888db230584afee1f9f0180
62 changes: 28 additions & 34 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -276,54 +276,48 @@ extension DSLTree.CustomCharacterClass.Member {
}
return c
case let .range(low, high):
// TODO:
guard let lhs = low.literalCharacterValue, lhs.hasExactlyOneScalar else {
throw Unsupported("\(low) in range")
}
guard let rhs = high.literalCharacterValue, rhs.hasExactlyOneScalar else {
throw Unsupported("\(high) in range")
}
guard lhs <= rhs else {
throw Unsupported("Invalid range \(low)-\(high)")
}

let isCaseInsensitive = opts.isCaseInsensitive
let isCharacterSemantic = opts.semanticLevel == .graphemeCluster

if opts.isCaseInsensitive {
let lhsLower = lhs.lowercased()
let rhsLower = rhs.lowercased()
guard lhsLower <= rhsLower else { throw Unsupported("Invalid range \(lhs)-\(rhs)") }
return { input, bounds in
// TODO: check for out of bounds?
let curIdx = bounds.lowerBound
if isCharacterSemantic {
guard input[curIdx].hasExactlyOneScalar else { return nil }
if (lhsLower...rhsLower).contains(input[curIdx].lowercased()) {
return input.index(after: curIdx)
}
} else {
if (lhsLower...rhsLower).contains(input.unicodeScalars[curIdx].properties.lowercaseMapping) {
return input.unicodeScalars.index(after: curIdx)
}
}
return { input, bounds in
// TODO: check for out of bounds?
let curIdx = bounds.lowerBound
let nextIndex = isCharacterSemantic
? input.index(after: curIdx)
: input.unicodeScalars.index(after: curIdx)
Comment on lines +295 to +297
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really feel like this should be a helper function or fully refactored. It seems ripe for bugs if vigilance is required to remember to support scalar semantics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it a helper function, but the real issue is that the two views share the same index type, and we can't enforce that you call the helper function. Maybe we could look at designing a wrapper type that would distinguish between the two? Seems like it add a lot of friction…

struct ViewLockedIndex<View> {
    var index: String.Index
}

extension String {
    func index(after i: ViewLockedIndex<String>) -> ViewLockedIndex<String> {
        .init(index: index(after: i.index))
    }

    func index(after i: ViewLockedIndex<String.UnicodeScalarView>) -> ViewLockedIndex<String.UnicodeScalarView>
        .init(index: self.unicodeScalars.index(after: i.index))
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is rather than have a bunch of inner-most ifs, what would the code look like to have an outer-if? That is a more bottoms-up refactoring

if isCharacterSemantic && !input[curIdx].hasExactlyOneScalar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do to non-NFC characters?

return nil
}
} else {
guard lhs <= rhs else { throw Unsupported("Invalid range \(lhs)-\(rhs)") }
return { input, bounds in
// TODO: check for out of bounds?
let curIdx = bounds.lowerBound
if isCharacterSemantic {
guard input[curIdx].hasExactlyOneScalar else { return nil }
if (lhs...rhs).contains(input[curIdx]) {
return input.index(after: curIdx)
}
} else {
if (lhs...rhs).contains(Character(input.unicodeScalars[curIdx])) {
return input.unicodeScalars.index(after: curIdx)
}
}
let scalar = input.unicodeScalars[curIdx]
let scalarRange = lhs.unicodeScalars.first! ... rhs.unicodeScalars.first!
if scalarRange.contains(scalar) {
return nextIndex
}
if !isCaseInsensitive {
return nil
}

let stringRange = String(lhs)...String(rhs)
if (scalar.properties.changesWhenLowercased
&& stringRange.contains(scalar.properties.lowercaseMapping))
|| (scalar.properties.changesWhenUppercased
&& stringRange.contains(scalar.properties.uppercaseMapping)) {
return nextIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going back to lexicographical contains?

}

return nil
}

case let .custom(ccc):
return try ccc.generateConsumer(opts)

Expand Down
27 changes: 27 additions & 0 deletions Tests/RegexTests/MatchTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -731,6 +731,33 @@ extension RegexTests {
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: "abc",
syntax: .experimental)
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: #""abc""#)

// Case sensitivity and ranges.
for ch in "abcD" {
firstMatchTest("[a-cD]", input: String(ch), match: String(ch))
}
for ch in "ABCd" {
firstMatchTest("[a-cD]", input: String(ch), match: nil)
}

for ch in "abcABCdD" {
firstMatchTest("(?i)[a-cd]", input: String(ch), match: String(ch))
firstMatchTest("(?i)[A-CD]", input: String(ch), match: String(ch))
firstMatchTest("(?iu)[a-cd]", input: String(ch), match: String(ch))
firstMatchTest("(?iu)[A-CD]", input: String(ch), match: String(ch))
}

for ch in "XYZ[\\]^_`abcd" {
firstMatchTest("[X-cd]", input: String(ch), match: String(ch))
firstMatchTest("[X-cd]", input: String(ch), match: String(ch))
firstMatchTest("(?u)[X-cd]", input: String(ch), match: String(ch))
firstMatchTest("(?u)[X-cd]", input: String(ch), match: String(ch))
}

for ch in "XYZ[\\]^_`abcxyzABCdD" {
firstMatchTest("(?i)[X-cd]", input: String(ch), match: String(ch))
firstMatchTest("(?iu)[X-cD]", input: String(ch), match: String(ch))
}
}

func testCharacterProperties() {
Expand Down