Skip to content

Commit ab104c3

Browse files
committed
Unicode correctness
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: #62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in #76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
1 parent 1d0aaa1 commit ab104c3

File tree

11 files changed

+354
-94
lines changed

11 files changed

+354
-94
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ jobs:
2525
output
2626
2727
- name: Install dependencies
28-
run: spago install
28+
run: spago -x spago-dev.dhall install
2929

3030
- name: Build source
31-
run: spago build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''
31+
run: spago -x spago-dev.dhall build --no-install --purs-args '--censor-lib --strict --censor-codes='UserDefinedWarning''
3232

3333
- name: Run tests
34-
run: spago test --no-install
34+
run: spago -x spago-dev.dhall test --no-install

CHANGELOG.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,28 @@ Notable changes to this project are documented in this file. The format is based
66

77
Breaking changes:
88

9+
- `anyChar` will no longer always succeed. It will only succeed on a Basic
10+
Multilingual Plane character. The new parser `anyCodePoint` will always
11+
succeed. (#119 by @jamesdbrock)
12+
- Delete the `StringLike` typeclass. Users must delete all `StringLike`
13+
constraints. (#119 by @jamesdbrock)
14+
- Move `updatePosString` to the `String` module and don’t
15+
export it. (#119 by @jamesdbrock)
16+
- Change the definition of `whiteSpace` and `skipSpaces` to
17+
`Data.CodePoint.Unicode.isSpace`. (#119 by @jamesdbrock)
18+
919
New features:
1020

21+
- Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
22+
`CodePoint`s. (#119 by @jamesdbrock)
23+
- Add `match` combinator (#119 by @jamesdbrock)
24+
- Add benchmark suite (#119 by @jamesdbrock)
25+
- Split the dev dependencies out into `spago-dev.dhall`.
26+
1127
Bugfixes:
1228

29+
- Unicode correctness (#119 by @jamesdbrock)
30+
1331
Other improvements:
1432

1533
## [v6.0.2](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.2) - 2021-05-09
@@ -26,12 +44,12 @@ Other improvements:
2644
## [v6.0.0](https://github.com/purescript-contrib/purescript-parsing/releases/tag/v6.0.0) - 2021-02-26
2745

2846
Breaking changes:
29-
- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93)
47+
- Improved performance of `string` and update `StringLike` to have `stripPrefix` as a class member instead of `indexOf` (#93)
3048
- Non-empty combinators now return `NonEmptyList` (#102)
3149
- Added support for PureScript 0.14 and dropped support for all previous versions (#101)
3250

3351
New features:
34-
- Derived `Generic` instance of Position (#87)
52+
- Derived `Generic` instance of Position (#87)
3553

3654
Bugfixes:
3755

CONTRIBUTING.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,16 @@
33
Thanks for your interest in contributing to `parsing`! We welcome new contributions regardless of your level of experience or familiarity with PureScript.
44

55
Every library in the Contributors organization shares a simple handbook that helps new contributors get started. With that in mind, please [read the short contributing guide on purescript-contrib/governance](https://github.com/purescript-contrib/governance/blob/main/contributing.md) before contributing to this library.
6+
7+
# Development
8+
9+
This package includes a `spago-dev.dhall` which provides the dependencies
10+
for development and testing.
11+
12+
## Testing
13+
14+
To run the test suite:
15+
16+
```
17+
spago -x spago-dev.dhall test
18+
```

bench/Main.purs

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
-- | # Benchmarking
2+
-- |
3+
-- | spago -x spago-dev.dhall run --main Bench.Main
4+
-- |
5+
-- | This benchmark suite is intended to guide changes to this package so that
6+
-- | we can compare the benchmarks of different commits.
7+
-- |
8+
-- | This benchmark suite also compares parsers to equivalent Regex. This
9+
-- | provides an answer to the common question “How much slower is this package
10+
-- | than Regex?” Answer: approximately 100×. The Regex benchmarks also give
11+
-- | us a rough way to calibrate benchmarks run on different platforms.
12+
-- |
13+
-- | # Profiling
14+
-- |
15+
-- | https://nodejs.org/en/docs/guides/simple-profiling/
16+
-- | https://nodesource.com/blog/diagnostics-in-NodeJS-2
17+
-- |
18+
-- | spago -x spago-dev.dhall build --source-maps
19+
-- | purs bundle output/**/*.js --source-maps --output ./index.bundle.js
20+
-- |
21+
-- |
22+
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
23+
-- | zephyr Bench.Main.main --codegen sourcemaps,js
24+
-- | purs bundle dce-output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.dce.bundle.js
25+
-- | node index.dce.bundle.js
26+
-- |
27+
-- | spago -x spago-dev.dhall build --source-maps --purs-args '--codegen corefn,sourcemaps'
28+
-- | purs bundle output/**/*.js --source-maps --module Bench.Main --main Bench.Main --output ./index.bundle.js
29+
-- | node index.bundle.js
30+
-- | node --prof --enable-source-maps ./index.bundle.js
31+
-- | node --prof-process --source-map ./index.bundle.js.map isolate--.log > prof.txt
32+
-- |
33+
-- | node --prof --enable-source-maps -e 'require("./output/Bench.Main/index.js").main()'
34+
-- | node --prof-process isolate--.log
35+
-- |
36+
-- | spago -x spago-dev.dhall build
37+
-- | node --prof -e 'require("./output/Bench.Main/index.js").main()'
38+
-- | node --prof-process isolate--.log > prof.txt
39+
40+
module Bench.Main where
41+
42+
import Prelude
43+
44+
import Data.Array (fold, replicate)
45+
import Data.Either (either)
46+
import Data.List (manyRec)
47+
import Data.List.Types (List)
48+
import Data.String.Regex (Regex, regex)
49+
import Data.String.Regex as Regex
50+
import Data.String.Regex.Flags (RegexFlags(..))
51+
import Effect (Effect)
52+
import Effect.Console (log)
53+
import Effect.Exception (throw)
54+
import Effect.Unsafe (unsafePerformEffect)
55+
import Performance.Minibench (benchWith)
56+
import Text.Parsing.Parser (Parser, runParser)
57+
import Text.Parsing.Parser.String (string)
58+
import Text.Parsing.Parser.Token (digit)
59+
import Text.Parsing.StringParser as StringParser
60+
import Text.Parsing.StringParser.CodePoints as StringParser.CodePoints
61+
import Text.Parsing.StringParser.CodeUnits as StringParser.CodeUnits
62+
63+
string23 :: String
64+
string23 = "23"
65+
string23_2 :: String
66+
string23_2 = fold $ replicate 2 string23
67+
string23_10000 :: String
68+
string23_10000 = fold $ replicate 10000 string23
69+
70+
stringSkidoo :: String
71+
stringSkidoo = "skidoo"
72+
stringSkidoo_2 :: String
73+
stringSkidoo_2 = fold $ replicate 2 stringSkidoo
74+
stringSkidoo_10000 :: String
75+
stringSkidoo_10000 = fold $ replicate 10000 stringSkidoo
76+
77+
parse23 :: Parser String (List Char)
78+
parse23 = manyRec digit
79+
80+
parse23Points :: StringParser.Parser (List Char)
81+
parse23Points = manyRec StringParser.CodePoints.anyDigit
82+
83+
parse23Units :: StringParser.Parser (List Char)
84+
parse23Units = manyRec StringParser.CodeUnits.anyDigit
85+
86+
pattern23 :: Regex
87+
pattern23 = either (unsafePerformEffect <<< throw) identity $
88+
regex "\\d" $ RegexFlags
89+
{ dotAll: true
90+
, global: true
91+
, ignoreCase: false
92+
, multiline: true
93+
, sticky: false
94+
, unicode: true
95+
}
96+
97+
parseSkidoo :: Parser String (List String)
98+
parseSkidoo = manyRec $ string "skidoo"
99+
100+
patternSkidoo :: Regex
101+
patternSkidoo = either (unsafePerformEffect <<< throw) identity $
102+
regex "skidoo" $ RegexFlags
103+
{ dotAll: true
104+
, global: true
105+
, ignoreCase: false
106+
, multiline: true
107+
, sticky: false
108+
, unicode: true
109+
}
110+
111+
main :: Effect Unit
112+
main = do
113+
-- log $ show $ runParser string23_2 parse23
114+
-- log $ show $ Regex.match pattern23 string23_2
115+
-- log $ show $ runParser stringSkidoo_2 parseSkidoo
116+
-- log $ show $ Regex.match patternSkidoo stringSkidoo_2
117+
log "runParser parse23"
118+
benchWith 200
119+
$ \_ -> runParser string23_10000 parse23
120+
log "StringParser.runParser parse23Points"
121+
benchWith 20
122+
$ \_ -> StringParser.runParser parse23Points string23_10000
123+
log "StringParser.runParser parse23Units"
124+
benchWith 200
125+
$ \_ -> StringParser.runParser parse23Units string23_10000
126+
log "Regex.match pattern23"
127+
benchWith 200
128+
$ \_ -> Regex.match pattern23 string23_10000
129+
log "runParser parseSkidoo"
130+
benchWith 200
131+
$ \_ -> runParser stringSkidoo_10000 parseSkidoo
132+
log "Regex.match patternSkidoo"
133+
benchWith 200
134+
$ \_ -> Regex.match patternSkidoo stringSkidoo_10000

spago-dev.dhall

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- Spago configuration for testing, benchmarking, development.
2+
--
3+
-- See:
4+
-- * ./CONTRIBUTING.md
5+
-- * https://github.com/purescript/spago#devdependencies-testdependencies-or-in-general-a-situation-with-many-configurations
6+
--
7+
8+
let conf = ./spago.dhall
9+
10+
in conf //
11+
{ sources = [ "src/**/*.purs", "test/**/*.purs", "bench/**/*.purs" ]
12+
, dependencies = conf.dependencies #
13+
[ "assert"
14+
, "console"
15+
, "effect"
16+
, "psci-support"
17+
, "minibench"
18+
, "exceptions"
19+
, "string-parsers"
20+
]
21+
, packages = ./packages.dhall
22+
}

spago.dhall

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
{ name = "parsing"
22
, dependencies =
33
[ "arrays"
4-
, "assert"
5-
, "console"
64
, "control"
7-
, "effect"
85
, "either"
96
, "foldable-traversable"
107
, "identity"
@@ -14,13 +11,13 @@
1411
, "maybe"
1512
, "newtype"
1613
, "prelude"
17-
, "psci-support"
1814
, "strings"
1915
, "tailrec"
2016
, "transformers"
2117
, "tuples"
2218
, "unicode"
19+
, "unsafe-coerce"
2320
]
2421
, packages = ./packages.dhall
25-
, sources = [ "src/**/*.purs", "test/**/*.purs" ]
22+
, sources = [ "src/**/*.purs" ]
2623
}

src/Text/Parsing/Parser/Language.purs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ import Prelude
1313
import Control.Alt ((<|>))
1414
import Text.Parsing.Parser (ParserT)
1515
import Text.Parsing.Parser.String (char, oneOf)
16-
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
16+
import Text.Parsing.Parser.Token (GenLanguageDef(..), LanguageDef, TokenParser, alphaNum, letter, makeTokenParser, unGenLanguageDef)
1717

1818
-----------------------------------------------------------
1919
-- Styles: haskellStyle, javaStyle

src/Text/Parsing/Parser/Pos.purs

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
module Text.Parsing.Parser.Pos where
22

33
import Prelude
4+
45
import Data.Generic.Rep (class Generic)
5-
import Data.Foldable (foldl)
6-
import Data.Newtype (wrap)
7-
import Data.String (split)
86

97
-- | `Position` represents the position of the parser in the input.
108
-- |
@@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
2725
-- | The `Position` before any input has been parsed.
2826
initialPos :: Position
2927
initialPos = Position { line: 1, column: 1 }
30-
31-
-- | Updates a `Position` by adding the columns and lines in `String`.
32-
updatePosString :: Position -> String -> Position
33-
updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
34-
where
35-
updatePosChar (Position pos) c = case c of
36-
"\n" -> Position { line: pos.line + 1, column: 1 }
37-
"\r" -> Position { line: pos.line + 1, column: 1 }
38-
"\t" -> Position { line: pos.line, column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
39-
_ -> Position { line: pos.line, column: pos.column + 1 }

0 commit comments

Comments
 (0)