Skip to content

Use a bitset for ascii-only character classes #511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5fd8840
[benchmark] Add no-capture version of grapheme breaking exercise
milseman Jun 19, 2022
03fe8d6
[benchmark] Add cross-engine benchmark helpers
milseman Jun 19, 2022
5667705
[benchmark] Hangul Syllable finding benchmark
milseman Jun 19, 2022
bde259b
Add debug mode
rctcwyvrn Jun 20, 2022
bf95e81
Fix typo in css regex
rctcwyvrn Jun 20, 2022
243ec7b
Add HTML benchmark
rctcwyvrn Jun 20, 2022
eeb0852
Add email regex benchmarks
rctcwyvrn Jun 20, 2022
49efd67
Add save/compare functionality to the benchmarker
rctcwyvrn Jun 20, 2022
b3a61a7
Clean up compare and add cli flags
rctcwyvrn Jun 20, 2022
926d208
Merge branch 'main' into more_more_benchmarks
milseman Jun 21, 2022
752ea76
Make fixes
rctcwyvrn Jun 21, 2022
7327e74
Merge branch 'more_more_benchmarks' of github.com:rctcwyvrn/swift-exp…
rctcwyvrn Jun 21, 2022
7a900b6
oops, remove some leftover code
rctcwyvrn Jun 21, 2022
50e8e6d
Fix linux build issue + add cli option for specifying compare file
rctcwyvrn Jun 21, 2022
3c7f62c
First ver of bitset character classes
rctcwyvrn Jun 22, 2022
b71b177
Did a dumb and didn't use the new api I had added...
rctcwyvrn Jun 22, 2022
e2a011c
Fix bug in inverted character sets
rctcwyvrn Jun 22, 2022
f7900e5
Remove nested chararcter class cases
rctcwyvrn Jun 22, 2022
e9d1902
Remove comment
rctcwyvrn Jun 22, 2022
cf59091
Merge branch 'main' into many-closures-vs-one-bitset-boi
rctcwyvrn Jun 22, 2022
f4019d4
Cleanup handling of isInverted
rctcwyvrn Jun 23, 2022
ed82cb0
Cleanup
rctcwyvrn Jun 23, 2022
cc1ac9d
Remove isCaseInsensitive property
rctcwyvrn Jun 23, 2022
ccf6ade
Add tests for special cases
rctcwyvrn Jun 23, 2022
7b83e0c
Use switch on ranges instead of if
rctcwyvrn Jun 24, 2022
5121076
Rename asciivalue to singleScalarAsciiValue
rctcwyvrn Jun 27, 2022
3607b65
Properly handle unicode scalars mode in custom character classes
rctcwyvrn Jun 27, 2022
291a974
I most definitely did not forget to commit the tests
rctcwyvrn Jun 27, 2022
ddcf40f
Cleanup
rctcwyvrn Jun 27, 2022
f87b325
Add support for testing if compilation contains certain opcodes
rctcwyvrn Jun 27, 2022
2d8ac2d
Forgot the tests again, twice in one day...
rctcwyvrn Jun 27, 2022
fd66693
Spelling mistakes
rctcwyvrn Jun 27, 2022
22c8213
Make expectProgram take sets of opcodes
rctcwyvrn Jun 27, 2022
0781b93
Add compiler options + validation testing against unoptimized regexes
rctcwyvrn Jun 28, 2022
ffff944
Cleanup, clear cache of Regex.Program when setting new compile options
rctcwyvrn Jun 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Cleanup
  • Loading branch information
rctcwyvrn committed Jun 23, 2022
commit ed82cb00a3e74b3f0fb89edf9f86b4bda0679773
3 changes: 1 addition & 2 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -643,8 +643,7 @@ fileprivate extension Compiler.ByteCodeGen {
mutating func emitCustomCharacterClass(
_ ccc: DSLTree.CustomCharacterClass
) throws {
if ccc.isAscii() {
let asciiBitset = ccc.asAsciiBitset(options)
if let asciiBitset = ccc.asAsciiBitset(options) {
builder.buildMatchAsciiBitset(asciiBitset)
} else {
let consumer = try ccc.generateConsumer(options)
Expand Down
112 changes: 39 additions & 73 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -52,39 +52,22 @@ extension DSLTree.Node {
}

extension DSLTree._AST.Atom {
func isAscii() -> Bool {
return ast.isAscii()
}

func toAsciiValue() -> UInt8 {
return ast.toAsciiValue()
var asciiValue: UInt8? {
return ast.asciiValue
}
}

extension DSLTree.Atom {
func isAscii() -> Bool {
var asciiValue: UInt8? {
switch self {
case let .char(c):
return c.isASCII
case let .scalar(s):
return s.isASCII
case let .unconverted(atom):
return atom.isAscii()
default:
return false
}
}

func toAsciiValue() -> UInt8 {
switch self {
case let .char(c):
return c.asciiValue!
case let .scalar(s):
case let .char(c) where c != "\r\n":
return c.asciiValue
case let .scalar(s) where s.isASCII:
return UInt8(ascii: s)
case let .unconverted(atom):
return atom.toAsciiValue()
return atom.asciiValue
default:
fatalError("Should have been checked by isAscii first")
return nil
}
}

Expand Down Expand Up @@ -213,26 +196,15 @@ extension AST.Atom {
default: return nil
}
}

func isAscii() -> Bool {
switch kind {
case let .char(c):
return c.isASCII
case let .scalar(s):
return s.value.isASCII
default:
return false
}
}

func toAsciiValue() -> UInt8 {
var asciiValue: UInt8? {
switch kind {
case let .char(c):
return c.asciiValue!
case let .scalar(s):
case let .char(c) where c != "\r\n":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future cleanup: something like the below to consolidate logic

extension Character {
  var _singleScalarASCIIValue: UInt8? { ... }
}

return c.asciiValue
case let .scalar(s) where s.value.isASCII:
return UInt8(ascii: s.value)
default:
fatalError("Should have been checked by isAscii first")
return nil
}
}

Expand Down Expand Up @@ -293,40 +265,32 @@ extension AST.Atom {
}

extension DSLTree.CustomCharacterClass.Member {
func isAscii() -> Bool {
switch self {
case let .atom(a):
return a.isAscii()
case let .range(low, high):
return low.isAscii() && high.isAscii()
// The remaining cases have nested character classes with possibly different
// inversion so leave them out of this optimization
default:
return false
}
}

func asAsciiBitset(
_ opts: MatchingOptions,
_ isInverted: Bool
) -> DSLTree.CustomCharacterClass.AsciiBitset {
) -> DSLTree.CustomCharacterClass.AsciiBitset? {
switch self {
case let .atom(a):
return DSLTree.CustomCharacterClass.AsciiBitset(
a.toAsciiValue(),
isInverted,
opts.isCaseInsensitive
)
if let val = a.asciiValue {
return DSLTree.CustomCharacterClass.AsciiBitset(
val,
isInverted,
opts.isCaseInsensitive
)
}
case let .range(low, high):
return DSLTree.CustomCharacterClass.AsciiBitset(
low: low.toAsciiValue(),
high: high.toAsciiValue(),
isInverted: isInverted,
isCaseInsensitive: opts.isCaseInsensitive
)
if let lowVal = low.asciiValue, let highVal = high.asciiValue {
return DSLTree.CustomCharacterClass.AsciiBitset(
low: lowVal,
high: highVal,
isInverted: isInverted,
isCaseInsensitive: opts.isCaseInsensitive
)
}
default:
fatalError("Should have been checked by isAscii first")
return nil
}
return nil
}

func generateConsumer(
Expand Down Expand Up @@ -436,14 +400,16 @@ extension DSLTree.CustomCharacterClass.Member {
}

extension DSLTree.CustomCharacterClass {
func isAscii() -> Bool {
return members.allSatisfy { member in member.isAscii() }
}

func asAsciiBitset(_ opts: MatchingOptions) -> AsciiBitset {
func asAsciiBitset(_ opts: MatchingOptions) -> AsciiBitset? {
return members.reduce(
.init(isInverted: isInverted, isCaseInsensitive: opts.isCaseInsensitive),
{result, member in result.union(member.asAsciiBitset(opts, isInverted))}
{result, member in
if let next = member.asAsciiBitset(opts, isInverted) {
return result?.union(next)
} else {
return nil
}
}
)
}

Expand Down
28 changes: 13 additions & 15 deletions Sources/_StringProcessing/Regex/DSLTree.swift
Original file line number Diff line number Diff line change
Expand Up @@ -170,26 +170,26 @@ extension DSLTree {
let isCaseInsensitive: Bool
var a: UInt64 = 0
var b: UInt64 = 0

init(isInverted: Bool, isCaseInsensitive: Bool) {
self.isInverted = isInverted
self.isCaseInsensitive = isCaseInsensitive
}

init(_ val: UInt8, _ isInverted: Bool, _ isCaseInsensitive: Bool) {
self.isInverted = isInverted
self.isCaseInsensitive = isCaseInsensitive
add(val)
}

init(low: UInt8, high: UInt8, isInverted: Bool, isCaseInsensitive: Bool) {
self.isInverted = isInverted
self.isCaseInsensitive = isCaseInsensitive
for val in low...high {
add(val)
}
}

internal init(
a: UInt64,
b: UInt64,
Expand All @@ -201,29 +201,27 @@ extension DSLTree {
self.a = a
self.b = b
}

internal mutating func add(_ val: UInt8) {
setBit(val)
if isCaseInsensitive {
let c = Character(Unicode.Scalar.init(val))
let otherCase: String
if c.isUppercase {
otherCase = c.lowercased()
} else {
otherCase = c.uppercased()
if val >= 64 && val <= 90 {
setBit(val + 32)
}
if val >= 97 && val <= 122 {
setBit(val - 32)
}
setBit(otherCase.first!.asciiValue!)
}
}

internal mutating func setBit(_ val: UInt8) {
if val < 64 {
a = a | 1 << val
} else {
b = b | 1 << (val - 64)
}
}

internal func matches(char: Character) -> Bool {
let ret: Bool
if let val = char.asciiValue {
Expand All @@ -242,7 +240,7 @@ extension DSLTree {

return ret
}

/// Joins another bitset from a Member of the same CustomCharacterClass
internal func union(_ other: AsciiBitset) -> AsciiBitset {
precondition(self.isInverted == other.isInverted)
Expand Down