Unicode correctness

jamesdbrock · jamesdbrock · commit 7795887abad3 · 2021-09-28T09:38:45.000+09:00
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: #62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in #76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
diff --git a/bench/Main.purs b/bench/Main.purs
@@ -26,8 +26,7 @@ import Effect.Exception (throw)
 import Effect.Unsafe (unsafePerformEffect)
 import Performance.Minibench (benchWith)
 import Text.Parsing.Parser (Parser, runParser)
-import Text.Parsing.Parser.Token (digit)
-import Text.Parsing.Parser.String (string)
+import Text.Parsing.Parser.String (digit, string)
 
 string23 :: String
 string23 = "23"
diff --git a/spago.dhall b/spago.dhall
@@ -16,6 +16,7 @@
   , "transformers"
   , "tuples"
   , "unicode"
+  , "unsafe-coerce"
   ]
 , packages = ./packages.dhall
 , sources = [ "src/**/*.purs" ]
diff --git a/src/Text/Parsing/Parser/Language.purs b/src/Text/Parsing/Parser/Language.purs
@@ -12,8 +12,8 @@ import Prelude
 
 import Control.Alt ((<|>))
 import Text.Parsing.Parser (ParserT)
-import Text.Parsing.Parser.String (char, oneOf)
-import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser, alphaNum, letter)
+import Text.Parsing.Parser.String (char, oneOf, alphaNum, letter)
+import Text.Parsing.Parser.Token (LanguageDef, TokenParser, GenLanguageDef(..), unGenLanguageDef, makeTokenParser)
 
 -----------------------------------------------------------
 -- Styles: haskellStyle, javaStyle
diff --git a/src/Text/Parsing/Parser/Pos.purs b/src/Text/Parsing/Parser/Pos.purs
@@ -1,10 +1,8 @@
 module Text.Parsing.Parser.Pos where
 
 import Prelude
+
 import Data.Generic.Rep (class Generic)
-import Data.Foldable (foldl)
-import Data.Newtype (wrap)
-import Data.String (split)
 
 -- | `Position` represents the position of the parser in the input.
 -- |
@@ -27,13 +25,3 @@ derive instance ordPosition :: Ord Position
 -- | The `Position` before any input has been parsed.
 initialPos :: Position
 initialPos = Position { line: 1, column: 1 }
-
--- | Updates a `Position` by adding the columns and lines in `String`.
-updatePosString :: Position -> String -> Position
-updatePosString pos' str = foldl updatePosChar pos' (split (wrap "") str)
-  where
-  updatePosChar (Position pos) c = case c of
-    "\n" -> Position { line: pos.line + 1, column: 1 }
-    "\r" -> Position { line: pos.line + 1, column: 1 }
-    "\t" -> Position { line: pos.line,     column: pos.column + 8 - ((pos.column - 1) `mod` 8) }
-    _    -> Position { line: pos.line,     column: pos.column + 1 }
diff --git a/src/Text/Parsing/Parser/String.purs b/src/Text/Parsing/Parser/String.purs
@@ -1,92 +1,200 @@
 -- | Primitive parsers for working with an input stream of type `String`.
-
-module Text.Parsing.Parser.String where
+-- |
+-- | The behavior of these primitive parsers is based on the behavior of the
+-- | `Data.String` module in the __strings__ package.
+-- | In most JavaScript runtime environments, the `String`
+-- | is little-endian [UTF-16](https://en.wikipedia.org/wiki/UTF-16).
+-- |
+-- | The primitive parsers which return `Char` will only succeed when the character
+-- | being parsed is a code point in the
+-- | [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane)
+-- | (the “BMP”). These parsers can be convenient because of the good support
+-- | that PureScript has for writing `Char` literals like `'あ'`, `'β'`, `'C'`.
+-- |
+-- | The other primitive parsers, which return `CodePoint` and `String` types,
+-- | can parse the full Unicode character set. All of the primitive parsers
+-- | in this module can be used together.
+module Text.Parsing.Parser.String
+( string
+, eof
+, anyChar
+, anyCodePoint
+, satisfy
+, satisfyCodePoint
+, char
+, whiteSpace
+, skipSpaces
+, oneOf
+, noneOf
+, match
+, digit
+, hexDigit
+, octDigit
+, upper
+, space
+, letter
+, alphaNum
+)
+where
 
 import Prelude hiding (between)
 
-import Control.Monad.State (gets, modify_)
-import Data.Array (many)
-import Data.Foldable (elem, notElem)
+import Control.Monad.State (get, put)
+import Data.Array (notElem)
+import Data.Char (fromCharCode)
+import Data.CodePoint.Unicode (isAlpha, isAlphaNum, isDecDigit, isHexDigit, isOctDigit, isSpace, isUpper)
+import Data.Foldable (elem)
 import Data.Maybe (Maybe(..))
-import Data.Newtype (wrap)
-import Data.String (Pattern)
-import Data.String as S
+import Data.String (CodePoint, Pattern(..), codePointFromChar, null, stripPrefix, uncons)
 import Data.String.CodeUnits as SCU
+import Data.Tuple (Tuple(..), fst)
 import Text.Parsing.Parser (ParseState(..), ParserT, fail)
-import Text.Parsing.Parser.Combinators (tryRethrow, (<?>))
-import Text.Parsing.Parser.Pos (updatePosString)
-
--- | This class exists to abstract over streams which support the string-like
--- | operations which this modules needs.
-class StringLike s where
-  drop :: Int -> s -> s
-  stripPrefix :: Pattern -> s -> Maybe s
-  null :: s -> Boolean
-  uncons :: s -> Maybe { head :: Char, tail :: s }
-
-instance stringLikeString :: StringLike String where
-  uncons = SCU.uncons
-  drop = S.drop
-  stripPrefix = S.stripPrefix
-  null = S.null
+import Text.Parsing.Parser.Combinators (skipMany, tryRethrow, (<?>))
+import Text.Parsing.Parser.Pos (Position(..))
+import Unsafe.Coerce (unsafeCoerce)
 
 -- | Match end-of-file.
-eof :: forall s m. StringLike s => Monad m => ParserT s m Unit
+eof :: forall m. Monad m => ParserT String m Unit
 eof = do
-  input <- gets \(ParseState input _ _) -> input
+  ParseState input _ _ <- get
   unless (null input) (fail "Expected EOF")
 
 -- | Match the specified string.
-string :: forall s m. StringLike s => Monad m => String -> ParserT s m String
+string :: forall m. Monad m => String -> ParserT String m String
 string str = do
-  input <- gets \(ParseState input _ _) -> input
-  case stripPrefix (wrap str) input of
+  ParseState input position _ <- get
+  case stripPrefix (Pattern str) input of
     Just remainder -> do
-      modify_ \(ParseState _ position _) ->
-        ParseState remainder
-                   (updatePosString position str)
-                   true
+      put $ ParseState remainder (updatePosString position str) true
       pure str
     _ -> fail ("Expected " <> show str)
 
--- | Match any character.
-anyChar :: forall s m. StringLike s => Monad m => ParserT s m Char
-anyChar = do
-  input <- gets \(ParseState input _ _) -> input
+-- | Match any BMP `Char`.
+-- | Parser will fail if the character is not in the Basic Multilingual Plane.
+anyChar :: forall m. Monad m => ParserT String m Char
+anyChar = tryRethrow do
+  cp :: Int <- deconstructCodePoint <$> anyCodePoint
+  -- the `fromCharCode` function doesn't check if this is beyond the
+  -- BMP, so we check that ourselves.
+  -- https://github.com/purescript/purescript-strings/issues/153
+  if cp > 65535 -- BMP
+    then fail "Not a Char"
+  else case fromCharCode cp of
+    Nothing -> fail "Not a Char"
+    Just c -> pure c
+
+-- | Match any Unicode character.
+-- | Always succeeds.
+anyCodePoint :: forall m. Monad m => ParserT String m CodePoint
+anyCodePoint = do
+  ParseState input position _ <- get
   case uncons input of
     Nothing -> fail "Unexpected EOF"
     Just { head, tail } -> do
-      modify_ \(ParseState _ position _) ->
-        ParseState tail
-                   (updatePosString position (SCU.singleton head))
-                   true
+      put $ ParseState tail (updatePosSingle position head) true
       pure head
 
--- | Match a character satisfying the specified predicate.
-satisfy :: forall s m. StringLike s => Monad m => (Char -> Boolean) -> ParserT s m Char
+-- | Match a BMP `Char` satisfying the predicate.
+satisfy :: forall m. Monad m => (Char -> Boolean) -> ParserT String m Char
 satisfy f = tryRethrow do
   c <- anyChar
-  if f c then pure c
-         else fail $ "Character '" <> SCU.singleton c <> "' did not satisfy predicate"
+  if f c
+    then pure c
+    else fail "Predicate unsatisfied"
 
--- | Match the specified character
-char :: forall s m. StringLike s => Monad m => Char -> ParserT s m Char
+-- | Match a Unicode character satisfying the predicate.
+satisfyCodePoint :: forall m. Monad m => (CodePoint -> Boolean) -> ParserT String m CodePoint
+satisfyCodePoint f = tryRethrow do
+  c <- anyCodePoint
+  if f c
+    then pure c
+    else fail "Predicate unsatisfied"
+
+-- | Match the specified BMP `Char`.
+char :: forall m. Monad m => Char -> ParserT String m Char
 char c = satisfy (_ == c) <?> show c
 
--- | Match zero or more whitespace characters.
-whiteSpace :: forall s m. StringLike s => Monad m => ParserT s m String
-whiteSpace = do
-  cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t'
-  pure $ SCU.fromCharArray cs
+-- | Match zero or more whitespace characters satisfying
+-- | `Data.CodePoint.Unicode.isSpace`.
+whiteSpace :: forall m. Monad m => ParserT String m String
+whiteSpace = fst <$> match skipSpaces
 
 -- | Skip whitespace characters.
-skipSpaces :: forall s m. StringLike s => Monad m => ParserT s m Unit
-skipSpaces = void whiteSpace
+skipSpaces :: forall m. Monad m => ParserT String m Unit
+skipSpaces = skipMany (satisfyCodePoint isSpace)
 
--- | Match one of the characters in the array.
-oneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
+-- | Match one of the BMP `Char`s in the array.
+oneOf :: forall m. Monad m => Array Char -> ParserT String m Char
 oneOf ss = satisfy (flip elem ss) <?> ("one of " <> show ss)
 
--- | Match any character not in the array.
-noneOf :: forall s m. StringLike s => Monad m => Array Char -> ParserT s m Char
+-- | Match any BMP `Char` not in the array.
+noneOf :: forall m. Monad m => Array Char -> ParserT String m Char
 noneOf ss = satisfy (flip notElem ss) <?> ("none of " <> show ss)
+
+-- | Updates a `Position` by adding the columns and lines in `String`.
+updatePosString :: Position -> String -> Position
+updatePosString pos str = case uncons str of
+  Nothing -> pos
+  Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive
+
+-- | Updates a `Position` by adding the columns and lines in a
+-- | single `CodePoint`.
+updatePosSingle :: Position -> CodePoint -> Position
+updatePosSingle (Position {line,column}) cp = case deconstructCodePoint cp of
+  10 -> Position { line: line + 1, column: 1 } -- "\n"
+  13 -> Position { line: line + 1, column: 1 } -- "\r"
+  9  -> Position { line, column: column + 8 - ((column - 1) `mod` 8) } -- "\t" Who says that one tab is 8 columns?
+  _  -> Position { line, column: column + 1 }
+
+-- | Combinator which returns both the result of a parse and the portion of
+-- | the input that was consumed while it was being parsed.
+match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a)
+match p = do
+  ParseState input1 _ _ <- get
+  x <- p
+  ParseState input2 _ _ <- get
+  -- We use the `SCU.length`, which is in units of “code units”
+  -- instead of `Data.String.length`. which is in units of “code points”.
+  -- This is more efficient, and it will be correct as long as we can assume
+  -- the invariant that the `ParseState input` always begins on a code point
+  -- boundary.
+  pure $ Tuple (SCU.take (SCU.length input1 - SCU.length input2) input1) x
+
+-- Helper function.
+satisfyCP :: forall m . Monad m => (CodePoint -> Boolean) -> ParserT String m Char
+satisfyCP p = satisfy (p <<< codePointFromChar)
+
+-- | Parse a digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isDecDigit`.
+digit :: forall m . Monad m => ParserT String m Char
+digit = satisfyCP isDecDigit <?> "digit"
+
+-- | Parse a hex digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isHexDigit`.
+hexDigit :: forall m . Monad m => ParserT String m Char
+hexDigit = satisfyCP isHexDigit <?> "hex digit"
+
+-- | Parse an octal digit. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isOctDigit`.
+octDigit :: forall m . Monad m => ParserT String m Char
+octDigit = satisfyCP isOctDigit <?> "oct digit"
+
+-- | Parse an uppercase letter. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isUpper`.
+upper :: forall m . Monad m => ParserT String m Char
+upper = satisfyCP isUpper <?> "uppercase letter"
+
+-- | Parse a space character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isSpace`.
+space :: forall m . Monad m => ParserT String m Char
+space = satisfyCP isSpace <?> "space"
+
+-- | Parse an alphabetical character. Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlpha`.
+letter :: forall m . Monad m => ParserT String m Char
+letter = satisfyCP isAlpha <?> "letter"
+
+-- | Parse an alphabetical or numerical character.
+-- | Matches any BMP `Char` that satisfies `Data.CodePoint.Unicode.isAlphaNum`.
+alphaNum :: forall m . Monad m => ParserT String m Char
+alphaNum = satisfyCP isAlphaNum <?> "letter or digit"
+
+-- The CodePoint newtype constructor is not exported, so here's a helper.
+-- This will break at runtime if the definition of CodePoint ever changes
+-- to something other than `newtype CodePoint = CodePoint Int`.
+deconstructCodePoint :: CodePoint -> Int
+deconstructCodePoint = unsafeCoerce
diff --git a/src/Text/Parsing/Parser/Token.purs b/src/Text/Parsing/Parser/Token.purs
diff --git a/test/Main.purs b/test/Main.purs

Original file line number	Diff line number	Diff line change
`@@ -16,6 +16,7 @@`
`16`	`16`	`, "transformers"`
`17`	`17`	`, "tuples"`
`18`	`18`	`, "unicode"`
	`19`	`+ , "unsafe-coerce"`
`19`	`20`	`]`
`20`	`21`	`, packages = ./packages.dhall`
`21`	`22`	`, sources = [ "src/*/.purs" ]`