Skip to content

Commit 7795887

Browse files
committed
Unicode correctness
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: #62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in #76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
1 parent cf4578b commit 7795887

File tree

7 files changed

+199
-134
lines changed

7 files changed

+199
-134
lines changed

bench/Main.purs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,7 @@ import Effect.Exception (throw)
2626
import Effect.Unsafe (unsafePerformEffect)
2727
import Performance.Minibench (benchWith)
2828
import Text.Parsing.Parser (Parser, runParser)
29-
import Text.Parsing.Parser.Token (digit)
30-
import Text.Parsing.Parser.String (string)
29+
import Text.Parsing.Parser.String (digit, string)
3130

3231
string23 :: String
3332
string23 = "23"

spago.dhall

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
, "transformers"
1717
, "tuples"
1818
, "unicode"
19+
, "unsafe-coerce"
1920
]
2021
, packages = ./packages.dhall
2122
, sources = [ "src/**/*.purs" ]

src/Text/Parsing/Parser/Language.purs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ import Prelude
1212

1313
import Control.Alt ((<|>))
1414
import Text.Parsing.Parser (ParserT)
15-
import Text.Parsing.Parser.String (char, oneOf)
16-
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
15+
import Text.Parsing.Parser.String (char, oneOf, alphaNum, letter)
16+
import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser)
1717

1818
-----------------------------------------------------------
1919
-- Styles: haskellStyle, javaStyle

src/Text/Parsing/Parser/Pos.purs

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
module Text.Parsing.Parser.Pos where
22

33
import Prelude
4+
45
import Data.Generic.Rep (class Generic)
5-
import Data.Foldable (foldl)
6-
import Data.Newtype (wrap)
7-
import Data.String (split)
86

97
-- | `Position` represents the position of the parser in the input.
108
-- |
@@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
2725
-- | The `Position` before any input has been parsed.
2826
initialPos :: Position
2927
initialPos = Position { line: 1, column: 1 }
30-
31-
-- | Updates a `Position` by adding the columns and lines in `String`.
32-
updatePosString :: Position -> String -> Position
33-
updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
34-
where
35-
updatePosChar (Position pos) c = case c of
36-
"\n" -> Position { line: pos.line + 1, column: 1 }
37-
"\r" -> Position { line: pos.line + 1, column: 1 }
38-
"\t" -> Position { line: pos.line, column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
39-
_ -> Position { line: pos.line, column: pos.column + 1 }

src/Text/Parsing/Parser/String.purs

Lines changed: 166 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,200 @@
11
-- | Primitive parsers for working with an input stream of type `String`.
2-
3-
module Text.Parsing.Parser.String where
2+
-- |
3+
-- | The behavior of these primitive parsers is based on the behavior of the
4+
-- | `Data.String` module in the __strings__ package.
5+
-- | In most JavaScript runtime environments, the `String`
6+
-- | is little-endian [UTF-16](https://en.wikipedia.org/wiki/UTF-16).
7+
-- |
8+
-- | The primitive parsers which return `Char` will only succeed when the character
9+
-- | being parsed is a code point in the
10+
-- | [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane)
11+
-- | (the “BMP”). These parsers can be convenient because of the good support
12+
-- | that PureScript has for writing `Char` literals like `'あ'`, `'β'`, `'C'`.
13+
-- |
14+
-- | The other primitive parsers, which return `CodePoint` and `String` types,
15+
-- | can parse the full Unicode character set. All of the primitive parsers
16+
-- | in this module can be used together.
17+
module Text.Parsing.Parser.String
18+
( string
19+
, eof
20+
, anyChar
21+
, anyCodePoint
22+
, satisfy
23+
, satisfyCodePoint
24+
, char
25+
, whiteSpace
26+
, skipSpaces
27+
, oneOf
28+
, noneOf
29+
, match
30+
, digit
31+
, hexDigit
32+
, octDigit
33+
, upper
34+
, space
35+
, letter
36+
, alphaNum
37+
)
38+
where
439

540
import Prelude hiding (between)
641

7-
import Control.Monad.State (gets, modify_)
8-
import Data.Array (many)
9-
import Data.Foldable (elem, notElem)
42+
import Control.Monad.State (get, put)
43+
import Data.Array (notElem)
44+
import Data.Char (fromCharCode)
45+
import Data.CodePoint.Unicode (isAlpha, isAlphaNum, isDecDigit, isHexDigit, isOctDigit, isSpace, isUpper)
46+
import Data.Foldable (elem)
1047
import Data.Maybe (Maybe(..))
11-
import Data.Newtype (wrap)
12-
import Data.String (Pattern)
13-
import Data.String as S
48+
import Data.String (CodePoint, Pattern(..), codePointFromChar, null, stripPrefix, uncons)
1449
import Data.String.CodeUnits as SCU
50+
import Data.Tuple (Tuple(..), fst)
1551
import Text.Parsing.Parser (ParseState(..), ParserT, fail)
16-
import Text.Parsing.Parser.Combinators (tryRethrow, (<?>))
17-
import Text.Parsing.Parser.Pos (updatePosString)
18-
19-
-- | This class exists to abstract over streams which support the string-like
20-
-- | operations which this modules needs.
21-
class StringLike s where
22-
drop :: Int -> s -> s
23-
stripPrefix :: Pattern -> s -> Maybe s
24-
null :: s -> Boolean
25-
uncons :: s -> Maybe { head :: Char, tail :: s }
26-
27-
instance stringLikeString :: StringLike String where
28-
uncons = SCU.uncons
29-
drop = S.drop
30-
stripPrefix = S.stripPrefix
31-
null = S.null
52+
import Text.Parsing.Parser.Combinators (skipMany, tryRethrow, (<?>))
53+
import Text.Parsing.Parser.Pos (Position(..))
54+
import Unsafe.Coerce (unsafeCoerce)
3255

3356
-- | Match end-of-file.
34-
eof :: forall s m. StringLike s => Monad m => ParserT s m Unit
57+
eof :: forall m. Monad m => ParserT String m Unit
3558
eof = do
36-
input <- gets \(ParseState input _ _) -> input
59+
ParseState input _ _ <- get
3760
unless (null input) (fail "Expected EOF")
3861

3962
-- | Match the specified string.
40-
string :: forall s m. StringLike s => Monad m => String -> ParserT s m String
63+
string :: forall m. Monad m => String -> ParserT String m String
4164
string str = do
42-
input <- gets \(ParseState input _ _) -> input
43-
case stripPrefix (wrap str) input of
65+
ParseState input position _ <- get
66+
case stripPrefix (Pattern str) input of
4467
Just remainder -> do
45-
modify_ \(ParseState _ position _) ->
46-
ParseState remainder
47-
(updatePosString position str)
48-
true
68+
put $ ParseState remainder (updatePosString position str) true
4969
pure str
5070
_ -> fail ("Expected " <> show str)
5171

52-
-- | Match any character.
53-
anyChar :: forall s m. StringLike s => Monad m => ParserT s m Char
54-
anyChar = do
55-
input <- gets \(ParseState input _ _) -> input
72+
-- | Match any BMP `Char`.
73+
-- | Parser will fail if the character is not in the Basic Multilingual Plane.
74+
anyChar :: forall m. Monad m => ParserT String m Char
75+
anyChar = tryRethrow do
76+
cp :: Int <- deconstructCodePoint <$> anyCodePoint
77+
-- the `fromCharCode` function doesn't check if this is beyond the
78+
-- BMP, so we check that ourselves.
79+
-- https://github.com/purescript/purescript-strings/issues/153
80+
if cp > 65535 -- BMP
81+
then fail "Not a Char"
82+
else case fromCharCode cp of
83+
Nothing -> fail "Not a Char"
84+
Just c -> pure c
85+
86+
-- | Match any Unicode character.
87+
-- | Always succeeds.
88+
anyCodePoint :: forall m. Monad m => ParserT String m CodePoint
89+
anyCodePoint = do
90+
ParseState input position _ <- get
5691
case uncons input of
5792
Nothing -> fail "Unexpected EOF"
5893
Just { head, tail } -> do
59-
modify_ \(ParseState _ position _) ->
60-
ParseState tail
61-
(updatePosString position (SCU.singleton head))
62-
true
94+
put $ ParseState tail (updatePosSingle position head) true
6395
pure head
6496

65-
-- | Match a character satisfying the specified predicate.
66-
satisfy :: forall s m. StringLike s => Monad m => (Char -> Boolean) -> ParserT s m Char
97+
-- | Match a BMP `Char` satisfying the predicate.
98+
satisfy :: forall m. Monad m => (Char -> Boolean) -> ParserT String m Char
6799
satisfy f = tryRethrow do
68100
c <- anyChar
69-
if f c then pure c
70-
else fail $ "Character '" <> SCU.singleton c <> "' did not satisfy predicate"
101+
if f c
102+
then pure c
103+
else fail "Predicate unsatisfied"
71104

72-
-- | Match the specified character
73-
char :: forall s m. StringLike s => Monad m => Char -> ParserT s m Char
105+
-- | Match a Unicode character satisfying the predicate.
106+
satisfyCodePoint :: forall m. Monad m => (CodePoint -> Boolean) -> ParserT String m CodePoint
107+
satisfyCodePoint f = tryRethrow do
108+
c <- anyCodePoint
109+
if f c
110+
then pure c
111+
else fail "Predicate unsatisfied"
112+
113+
-- | Match the specified BMP `Char`.
114+
char :: forall m. Monad m => Char -> ParserT String m Char
74115
char c = satisfy (_ == c) <?> show c
75116

76-
-- | Match zero or more whitespace characters.
77-
whiteSpace :: forall s m. StringLike s => Monad m => ParserT s m String
78-
whiteSpace = do
79-
cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t'
80-
pure $ SCU.fromCharArray cs
117+
-- | Match zero or more whitespace characters satisfying
118+
-- | `Data.CodePoint.Unicode.isSpace`.
119+
whiteSpace :: forall m. Monad m => ParserT String m String
120+
whiteSpace = fst <$> match skipSpaces
81121

82122
-- | Skip whitespace characters.
83-
skipSpaces :: forall s m. StringLike s => Monad m => ParserT s m Unit
84-
skipSpaces = void whiteSpace
123+
skipSpaces :: forall m. Monad m => ParserT String m Unit
124+
skipSpaces = skipMany (satisfyCodePoint isSpace)
85125

86-
-- | Match one of the characters in the array.
87-
oneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
126+
-- | Match one of the BMP `Char`s in the array.
127+
oneOf :: forall m. Monad m => Array Char -> ParserT String m Char
88128
oneOf ss = satisfy (flip elem ss) <?> ("one of " <> show ss)
89129

90-
-- | Match any character not in the array.
91-
noneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
130+
-- | Match any BMP `Char` not in the array.
131+
noneOf :: forall m. Monad m => Array Char -> ParserT String m Char
92132
noneOf ss = satisfy (flip notElem ss) <?> ("none of " <> show ss)
133+
134+
-- | Updates a `Position` by adding the columns and lines in `String`.
135+
updatePosString :: Position -> String -> Position
136+
updatePosString pos str = case uncons str of
137+
Nothing -> pos
138+
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive
139+
140+
-- | Updates a `Position` by adding the columns and lines in a
141+
-- | single `CodePoint`.
142+
updatePosSingle :: Position -> CodePoint -> Position
143+
updatePosSingle (Position {line,column}) cp = case deconstructCodePoint cp of
144+
10 -> Position { line: line + 1, column: 1 } -- "\n"
145+
13 -> Position { line: line + 1, column: 1 } -- "\r"
146+
9 -> Position { line, column: column + 8 - ((column - 1) `mod` 8) } -- "\t" Who says that one tab is 8 columns?
147+
_ -> Position { line, column: column + 1 }
148+
149+
-- | Combinator which returns both the result of a parse and the portion of
150+
-- | the input that was consumed while it was being parsed.
151+
match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a)
152+
match p = do
153+
ParseState input1 _ _ <- get
154+
x <- p
155+
ParseState input2 _ _ <- get
156+
-- We use the `SCU.length`, which is in units of “code units”
157+
-- instead of `Data.String.length`. which is in units of “code points”.
158+
-- This is more efficient, and it will be correct as long as we can assume
159+
-- the invariant that the `ParseState input` always begins on a code point
160+
-- boundary.
161+
pure $ Tuple (SCU.take (SCU.length input1 - SCU.length input2) input1) x
162+
163+
-- Helper function.
164+
satisfyCP :: forall m . Monad m => (CodePoint -> Boolean) -> ParserT String m Char
165+
satisfyCP p = satisfy (p <<< codePointFromChar)
166+
167+
-- | Parse a digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isDecDigit`.
168+
digit :: forall m . Monad m => ParserT String m Char
169+
digit = satisfyCP isDecDigit <?> "digit"
170+
171+
-- | Parse a hex digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isHexDigit`.
172+
hexDigit :: forall m . Monad m => ParserT String m Char
173+
hexDigit = satisfyCP isHexDigit <?> "hex digit"
174+
175+
-- | Parse an octal digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isOctDigit`.
176+
octDigit :: forall m . Monad m => ParserT String m Char
177+
octDigit = satisfyCP isOctDigit <?> "oct digit"
178+
179+
-- | Parse an uppercase letter. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isUpper`.
180+
upper :: forall m . Monad m => ParserT String m Char
181+
upper = satisfyCP isUpper <?> "uppercase letter"
182+
183+
-- | Parse a space character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isSpace`.
184+
space :: forall m . Monad m => ParserT String m Char
185+
space = satisfyCP isSpace <?> "space"
186+
187+
-- | Parse an alphabetical character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlpha`.
188+
letter :: forall m . Monad m => ParserT String m Char
189+
letter = satisfyCP isAlpha <?> "letter"
190+
191+
-- | Parse an alphabetical or numerical character.
192+
-- | Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlphaNum`.
193+
alphaNum :: forall m . Monad m => ParserT String m Char
194+
alphaNum = satisfyCP isAlphaNum <?> "letter or digit"
195+
196+
-- The CodePoint newtype constructor is not exported, so here's a helper.
197+
-- This will break at runtime if the definition of CodePoint ever changes
198+
-- to something other than `newtype CodePoint = CodePoint Int`.
199+
deconstructCodePoint :: CodePoint -> Int
200+
deconstructCodePoint = unsafeCoerce

0 commit comments

Comments
 (0)