Skip to content

[swift/main] Merge performance improvements into swift/main #705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
0b38ca9
Atomically load the lowered program (#610)
natecook1000 Oct 6, 2022
335a0c2
Add tests for line start/end word boundary diffs (#616)
natecook1000 Dec 2, 2022
54ff516
Add tweaks for Android
finagolfin Dec 5, 2022
eb7f801
Fix documentation typo (#615)
ole Dec 6, 2022
c51e8f2
Fix abstract for Regex.dotMatchesNewlines(_:). (#614)
amartini51 Dec 6, 2022
45f752a
Remove `RegexConsumer` and fix its dependencies (#617)
natecook1000 Dec 14, 2022
ed95066
Improve StringProcessing and RegexBuilder documentation (#611)
natecook1000 Dec 14, 2022
c34cea5
Set availability for inverted character class test (#621)
natecook1000 Dec 16, 2022
3ca8b13
Merge pull request #618 from buttaface/droid
Azoy Dec 18, 2022
3a3dc7a
Add type annotations in RegexBuilder tests
natecook1000 Feb 1, 2023
6c4f291
Workaround for fileprivate array issue
natecook1000 Feb 1, 2023
7e059b7
Merge pull request #628 from apple/result_builder_changes_workaround
natecook1000 Feb 1, 2023
6a4077f
Fix an issue where named character classes weren't getting converted …
DaveEwing Feb 1, 2023
8184fc0
Merge pull request #629 from apple/dewing/CharacterClassDSLConversion
DaveEwing Feb 2, 2023
2a78475
Stop at end of search string in TwoWaySearcher (#631)
natecook1000 Feb 8, 2023
d5a6cec
Correct misspelling in DSL renderer (#627)
natecook1000 Feb 8, 2023
7756942
Fix output type mismatch with RegexBuilder (#626)
natecook1000 Feb 9, 2023
070e0ec
Revert "Merge pull request #628 from apple/result_builder_changes_wor…
natecook1000 Feb 15, 2023
1358fc0
Use `some` syntax in variadics
natecook1000 Feb 15, 2023
083d32a
Type checker workaround: adjust test
milseman Apr 2, 2023
ca92db7
Further refactor to work around type checker regression
milseman Apr 3, 2023
336f9c5
Merge pull request #643 from milseman/typechecker_workaround
milseman Apr 3, 2023
852b890
Align availability macro with OS versions (#641)
milseman Apr 4, 2023
236b47c
Speed up general character class matching (#642)
milseman Apr 4, 2023
348e6c3
Test for \s matching CRLF when scalar matching (#648)
natecook1000 Apr 4, 2023
a7ba701
General ascii fast paths for character classes (#644)
milseman Apr 4, 2023
e01e43d
Remove the unsupported `anyScalar` case (#650)
natecook1000 Apr 4, 2023
e0352a2
Fix range-based quantification fast path (#653)
natecook1000 Apr 11, 2023
923cf5e
Add in ASCII fast-path for anyNonNewline (#654)
milseman Apr 11, 2023
9ea9936
Avoid long expression type checks (#657)
natecook1000 Apr 11, 2023
58626cc
Processor cleanup (#655)
milseman Apr 14, 2023
4418183
Fix `firstRange(of:)` search (#656)
natecook1000 Apr 14, 2023
57b343d
Bug fix and hot path for quantified `.` (#658)
milseman Apr 18, 2023
6695027
Run scalar-semantic benchmark variants (#659)
milseman Apr 18, 2023
8eafd55
Refactor operations to be on String (#664)
milseman Apr 19, 2023
0354667
Provide unique generic method parameter names (#669)
natecook1000 May 16, 2023
98d5ddc
Enable quantification optimizations for scalar semantics (#671)
milseman May 22, 2023
b4b4315
Fix doc comment for trimPrefix and trimmingPrefix funcs (#673)
valeriyvan May 23, 2023
d61ba4c
Update availability for the 5.8 release (#680)
natecook1000 Jul 6, 2023
4c9b28e
Optimize search for start-anchored regexes (#682)
natecook1000 Jul 19, 2023
ba6e49d
Fix misuse of `XCTSkip()` (#685)
grynspan Jul 20, 2023
f5b0b5e
Handle boundaries when matching in substrings (#675)
natecook1000 Aug 11, 2023
185ebd6
Overhaul quantification fast-path (#689)
milseman Aug 21, 2023
d45027b
adopt the stdlib’s pattern for atomic lazy references
glessard Aug 31, 2023
74637cc
pass a pointer instead of inout conversion
glessard Aug 31, 2023
bc70423
Update Sources/_StringProcessing/Regex/Core.swift
glessard Aug 31, 2023
45fd8ec
Merge pull request #691 from glessard/pointer-conversion-restriction
glessard Sep 1, 2023
69f406c
Adds SPI for a NSRE compatibility mode option (#698)
natecook1000 Dec 1, 2023
4e742b4
Add ASCII fast-path ASCII character class matching (#690)
milseman Dec 11, 2023
4760abe
Merge remote-tracking branch 'origin/main' into swift/main
milseman Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Handle boundaries when matching in substrings (#675)
* Handle boundaries when matching in substrings

Some of our existing matching routines use the start/endIndex
of the input, which is basically never the right thing to do.

This change revises those checks to use the search bounds, by
either moving the boundary check out of the matching method, or
if the boundary is a part of what needs to be matched (e.g.
word boundaries have different behavior at the start/end than
in the middle of a string) the search bounds are passed into
the matching method.

Testing is currently handled by piggy-backing on the existing
match tests; we should add more tests to handle substring-
specific edge cases.

* Handle sub-character substring boundaries

This change passes the end boundary down into matching methods, and
uses it to find the actual character that is part of the input
substring, even if the substring's end boundary is in the middle of
a grapheme cluster.

Substrings cannot have sub-Unicode scalar boundaries as of Swift
5.7; we can remove a check for this when matching an individual
scalar.
  • Loading branch information
natecook1000 authored Aug 11, 2023
commit f5b0b5ee7654be5e7383b3e3bcc1daa5557751a9
16 changes: 9 additions & 7 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,16 @@ extension Character {
switch opts.semanticLevel {
case .graphemeCluster:
return { input, bounds in
let low = bounds.lowerBound
guard let (char, next) = input.characterAndEnd(
at: bounds.lowerBound, limitedBy: bounds.upperBound)
else { return nil }
if isCaseInsensitive && isCased {
return input[low].lowercased() == lowercased()
? input.index(after: low)
return char.lowercased() == self.lowercased()
? next
: nil
} else {
return input[low] == self
? input.index(after: low)
return char == self
? next
: nil
}
}
Expand Down Expand Up @@ -188,7 +190,7 @@ extension DSLTree.Atom.CharacterClass {
func generateConsumer(_ opts: MatchingOptions) -> MEProgram.ConsumeFunction {
let model = asRuntimeModel(opts)
return { input, bounds in
model.matches(in: input, at: bounds.lowerBound)
model.matches(in: input, at: bounds.lowerBound, limitedBy: bounds.upperBound)
}
}
}
Expand Down Expand Up @@ -517,7 +519,7 @@ extension DSLTree.CustomCharacterClass {
}
if isInverted {
return opts.semanticLevel == .graphemeCluster
? input.index(after: bounds.lowerBound)
? Swift.min(input.index(after: bounds.lowerBound), bounds.upperBound)
: input.unicodeScalars.index(after: bounds.lowerBound)
}
return nil
Expand Down
109 changes: 77 additions & 32 deletions Sources/_StringProcessing/Engine/MEBuiltins.swift
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@ extension Processor {
isStrictASCII: Bool,
isScalarSemantics: Bool
) -> Bool {
guard let next = input.matchBuiltinCC(
guard currentPosition < end, let next = input.matchBuiltinCC(
cc,
at: currentPosition,
limitedBy: end,
isInverted: isInverted,
isStrictASCII: isStrictASCII,
isScalarSemantics: isScalarSemantics
Expand Down Expand Up @@ -102,56 +103,96 @@ extension Processor {

case .wordBoundary:
if payload.usesSimpleUnicodeBoundaries {
// TODO: How should we handle bounds?
return atSimpleBoundary(payload.usesASCIIWord, payload.semanticLevel)
} else {
return input.isOnWordBoundary(at: currentPosition, using: &wordIndexCache, &wordIndexMaxIndex)
return input.isOnWordBoundary(at: currentPosition, in: searchBounds, using: &wordIndexCache, &wordIndexMaxIndex)
}

case .notWordBoundary:
if payload.usesSimpleUnicodeBoundaries {
// TODO: How should we handle bounds?
return !atSimpleBoundary(payload.usesASCIIWord, payload.semanticLevel)
} else {
return !input.isOnWordBoundary(at: currentPosition, using: &wordIndexCache, &wordIndexMaxIndex)
return !input.isOnWordBoundary(at: currentPosition, in: searchBounds, using: &wordIndexCache, &wordIndexMaxIndex)
}
}
}
}

// MARK: Matching `.`
extension String {
// TODO: Should the below have a `limitedBy` parameter?
/// Returns the character at `pos`, bounded by `end`, as well as the upper
/// boundary of the returned character.
///
/// This function handles loading a character from a string while respecting
/// an end boundary, even if that end boundary is sub-character or sub-scalar.
///
/// - If `pos` is at or past `end`, this function returns `nil`.
/// - If `end` is between `pos` and the next grapheme cluster boundary (i.e.,
/// `end` is before `self.index(after: pos)`, then the returned character
/// is smaller than the one that would be produced by `self[pos]` and the
/// returned index is at the end of that character.
/// - If `end` is between `pos` and the next grapheme cluster boundary, and
/// is not on a Unicode scalar boundary, the partial scalar is dropped. This
/// can result in a `nil` return or a character that includes only part of
/// the `self[pos]` character.
///
/// - Parameters:
/// - pos: The position to load a character from.
/// - end: The limit for the character at `pos`.
/// - Returns: The character at `pos`, bounded by `end`, if it exists, along
/// with the upper bound of that character. The upper bound is always
/// scalar-aligned.
func characterAndEnd(at pos: String.Index, limitedBy end: String.Index) -> (Character, String.Index)? {
// FIXME: Sink into the stdlib to avoid multiple boundary calculations
guard pos < end else { return nil }
let next = index(after: pos)
if next <= end {
return (self[pos], next)
}

// `end` must be a sub-character position that is between `pos` and the
// next grapheme boundary. This is okay if `end` is on a Unicode scalar
// boundary, but if it's in the middle of a scalar's code units, there
// may not be a character to return at all after rounding down. Use
// `Substring`'s rounding to determine what we can return.
let substr = self[pos..<end]
return substr.isEmpty
? nil
: (substr.first!, substr.endIndex)
}

func matchAnyNonNewline(
at currentPosition: String.Index,
limitedBy end: String.Index,
isScalarSemantics: Bool
) -> String.Index? {
guard currentPosition < endIndex else {
return nil
}
guard currentPosition < end else { return nil }
if case .definite(let result) = _quickMatchAnyNonNewline(
at: currentPosition,
limitedBy: end,
isScalarSemantics: isScalarSemantics
) {
assert(result == _thoroughMatchAnyNonNewline(
at: currentPosition,
limitedBy: end,
isScalarSemantics: isScalarSemantics))
return result
}
return _thoroughMatchAnyNonNewline(
at: currentPosition,
limitedBy: end,
isScalarSemantics: isScalarSemantics)
}

@inline(__always)
private func _quickMatchAnyNonNewline(
at currentPosition: String.Index,
limitedBy end: String.Index,
isScalarSemantics: Bool
) -> QuickResult<String.Index?> {
assert(currentPosition < endIndex)
assert(currentPosition < end)
guard let (asciiValue, next, isCRLF) = _quickASCIICharacter(
at: currentPosition
at: currentPosition, limitedBy: end
) else {
return .unknown
}
Expand All @@ -167,46 +208,47 @@ extension String {
@inline(never)
private func _thoroughMatchAnyNonNewline(
at currentPosition: String.Index,
limitedBy end: String.Index,
isScalarSemantics: Bool
) -> String.Index? {
assert(currentPosition < endIndex)
if isScalarSemantics {
guard currentPosition < end else { return nil }
let scalar = unicodeScalars[currentPosition]
guard !scalar.isNewline else { return nil }
return unicodeScalars.index(after: currentPosition)
}

let char = self[currentPosition]
guard !char.isNewline else { return nil }
return index(after: currentPosition)
guard let (char, next) = characterAndEnd(at: currentPosition, limitedBy: end),
!char.isNewline
else { return nil }
return next
}
}

// MARK: - Built-in character class matching
extension String {
// TODO: Should the below have a `limitedBy` parameter?

// Mentioned in ProgrammersManual.md, update docs if redesigned
func matchBuiltinCC(
_ cc: _CharacterClassModel.Representation,
at currentPosition: String.Index,
limitedBy end: String.Index,
isInverted: Bool,
isStrictASCII: Bool,
isScalarSemantics: Bool
) -> String.Index? {
guard currentPosition < endIndex else {
return nil
}
guard currentPosition < end else { return nil }
if case .definite(let result) = _quickMatchBuiltinCC(
cc,
at: currentPosition,
limitedBy: end,
isInverted: isInverted,
isStrictASCII: isStrictASCII,
isScalarSemantics: isScalarSemantics
) {
assert(result == _thoroughMatchBuiltinCC(
cc,
at: currentPosition,
limitedBy: end,
isInverted: isInverted,
isStrictASCII: isStrictASCII,
isScalarSemantics: isScalarSemantics))
Expand All @@ -215,6 +257,7 @@ extension String {
return _thoroughMatchBuiltinCC(
cc,
at: currentPosition,
limitedBy: end,
isInverted: isInverted,
isStrictASCII: isStrictASCII,
isScalarSemantics: isScalarSemantics)
Expand All @@ -225,13 +268,17 @@ extension String {
private func _quickMatchBuiltinCC(
_ cc: _CharacterClassModel.Representation,
at currentPosition: String.Index,
limitedBy end: String.Index,
isInverted: Bool,
isStrictASCII: Bool,
isScalarSemantics: Bool
) -> QuickResult<String.Index?> {
assert(currentPosition < endIndex)
assert(currentPosition < end)
guard let (next, result) = _quickMatch(
cc, at: currentPosition, isScalarSemantics: isScalarSemantics
cc,
at: currentPosition,
limitedBy: end,
isScalarSemantics: isScalarSemantics
) else {
return .unknown
}
Expand All @@ -243,27 +290,25 @@ extension String {
private func _thoroughMatchBuiltinCC(
_ cc: _CharacterClassModel.Representation,
at currentPosition: String.Index,
limitedBy end: String.Index,
isInverted: Bool,
isStrictASCII: Bool,
isScalarSemantics: Bool
) -> String.Index? {
assert(currentPosition < endIndex)
let char = self[currentPosition]
// TODO: Branch here on scalar semantics
// Don't want to pay character cost if unnecessary
guard var (char, next) =
characterAndEnd(at: currentPosition, limitedBy: end)
else { return nil }
let scalar = unicodeScalars[currentPosition]

let asciiCheck = !isStrictASCII
|| (scalar.isASCII && isScalarSemantics)
|| char.isASCII

var matched: Bool
var next: String.Index
switch (isScalarSemantics, cc) {
case (_, .anyGrapheme):
next = index(after: currentPosition)
case (true, _):
if isScalarSemantics && cc != .anyGrapheme {
next = unicodeScalars.index(after: currentPosition)
case (false, _):
next = index(after: currentPosition)
}

switch cc {
Expand Down Expand Up @@ -291,7 +336,7 @@ extension String {
if isScalarSemantics {
matched = scalar.isNewline && asciiCheck
if matched && scalar == "\r"
&& next != endIndex && unicodeScalars[next] == "\n" {
&& next < end && unicodeScalars[next] == "\n" {
// Match a full CR-LF sequence even in scalar semantics
unicodeScalars.formIndex(after: &next)
}
Expand Down
10 changes: 6 additions & 4 deletions Sources/_StringProcessing/Engine/MEQuantify.swift
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,18 @@ extension Processor {
boundaryCheck: !isScalarSemantics,
isCaseInsensitive: false)
case .builtin:
// FIXME: bounds check? endIndex or end?
guard currentPosition < end else { return nil }

// We only emit .quantify if it consumes a single character
return input.matchBuiltinCC(
payload.builtin,
at: currentPosition,
limitedBy: end,
isInverted: payload.builtinIsInverted,
isStrictASCII: payload.builtinIsStrict,
isScalarSemantics: isScalarSemantics)
case .any:
// FIXME: endIndex or end?
guard currentPosition < input.endIndex else { return nil }
guard currentPosition < end else { return nil }

if payload.anyMatchesNewline {
if isScalarSemantics {
Expand All @@ -38,7 +38,9 @@ extension Processor {
}

return input.matchAnyNonNewline(
at: currentPosition, isScalarSemantics: isScalarSemantics)
at: currentPosition,
limitedBy: end,
isScalarSemantics: isScalarSemantics)
}
}

Expand Down
Loading