Add parseFloatThousandSep #15421

juancarlospaco · 2020-09-28T11:47:12Z

Convenience func for parseFloat designed to parse floats with thousand separators as found in the wild formatted for humans.
Documentation, runnableExamples with doAssert, since, changelog, etc.

doAssert parseFloatThousandSep("1,000,000.000") == 1000000.0
doAssert parseFloatThousandSep("1'000'000.000", thousandSep = '\'') == 1000000.0

echo parseFloat("1,000,000.000")
  Error: unhandled exception: invalid float: 1,000,000.000 [ValueError]
echo parseFloat("1'000'000.000")
  Error: unhandled exception: invalid float: 1'000'000.000 [ValueError]

:)

links

Formatting a float to currency - Nim forum

lib/pure/strmisc.nim

Vindaar · 2020-09-29T08:54:01Z

I think this might be a good addition.

However, if we are to have something like this I would trade correctness for speed. Perform a first pass over the string and make explicit checks, namely:

no separator before a digit
first separator can be anywhere after first digit
there has to be 3 digits between successive separators and between the last separator and the decimal dot
no separator after decimal dot
...? (maybe I forgot something else)

Have that pass over the string remove the separators and hand the result to normal parseFloat.

juancarlospaco · 2020-09-29T10:17:24Z

@Vindaar
Have any pseudocode, code or github suggestion on how you imagine it?, your comment makes sense.

juancarlospaco · 2020-09-29T10:23:26Z

This is for strings as formatted for humans with quirky punctuation,
sometimes used for money, coords, big floats on sciences, etc etc.

Thats why I named differently, because does different things than parseFloat.
Can be made stricter and have more options yeah...

Vindaar · 2020-09-29T14:07:43Z

@juancarlospaco I'll write a couple of lines later if I don't forget (ping me otherwise). Not a parsing expert though. :P

Vindaar · 2020-09-29T15:39:41Z

Probably neither the smartest, nor most efficient way, but this is what I came up with:

edit: haha, I just noticed I forgot about both minus and exp notation. The latter isn't important (who in their right mind would combine the two, but the former kinda is). If you think this is useful, I can add it. But won't do it now. Doesn't change the idea / approach.

import parseutils, strutils

proc parseFloatThousandSep(s: string,
                           sep: static char = ',',
                           decimalDot: static char = '.'): float = # maybe should not have a default sep?
  ## version of `parseFloat` which allows for thousand separators.
  ## The following assumptions / requirements must be met by the string
  ## - no separator before a digit
  ## - first separator can be anywhere after first digit, but no more than 3 characters
  ## - there ``has`` to be 3 digits between successive separators
  ##   - and between the last separator and the decimal dot
  ## - no separator after decimal dot
  ## - no duplicate separators
  ## - floats without separator allowed
  var
    buf = s
    idx = 0
    successive = 0
    afterDot = false
    lastWasDot = false
    lastWasSep = false
    hasAnySep = false
  template bail(msg: untyped): untyped =
    raise newException(ValueError, "Invalid float containing thousand separator." &
      " Reason: " & $msg)

  while idx < buf.len:
    case buf[idx]
    of sep:
      if idx == 0:
        bail("String starts with thousand separator.")
      elif lastWasSep:
        bail("Two separators in a row.")
      elif afterDot:
        bail("Separator found after decimal dot.")
      buf.delete(idx, idx)
      lastWasSep = true
      hasAnySep = true
      successive = 0
    of '0' .. '9':
      if hasAnySep and successive > 2:
        bail("More than 3 digits between thousand separators.")
      lastWasSep = false
      lastWasDot = false
      inc successive
      inc idx
    of decimalDot:
      if idx == 0:
        bail("String starts with decimal dot.")
      elif hasAnySep and successive != 3:
        bail("Not 3 successive digits before decimal point, despite larger 1000!")
      successive = 0
      lastWasDot = true
      afterDot = true
      inc idx
    else:
      # NOTE: could also move separator logic here if we wanted runtime separator
      # selection. Case needs CT info
      bail("Invalid character in float: " & $buf[idx])
  result = buf.parseFloat


doAssert parseFloatThousandSep("1.0") == 1.0
doAssert parseFloatThousandSep("1.000") == 1.0
doAssert parseFloatThousandSep("1,000") == 1000.0
doAssertRaises(ValueError):
  # invalid because , not the sep
  discard parseFloatThousandSep("1,000", sep = '\'')
# compile time error due to duplicate case label
# parseFloatThousandSep("1,000", sep = '.')
doAssert parseFloatThousandSep("10,000.000") == 10000.0
doAssertRaises(ValueError):
  # thousand sep after decimal dot
  discard parseFloatThousandSep("10.000,000")
doAssert parseFloatThousandSep("1,000,000.000") == 1000000.0
doAssert parseFloatThousandSep("10,000,000.000") == 10000000.0
doAssertRaises(ValueError):
  # starts with sep
  discard parseFloatThousandSep(",123.000")
doAssertRaises(ValueError):
  # starts with decimal dot
  discard parseFloatThousandSep(".000")
doAssertRaises(ValueError):
  # duplicate thousand sep
  discard parseFloatThousandSep("123,,100.0")
doAssertRaises(ValueError):
  # sep before dot
  discard parseFloatThousandSep("123,.0")

Up to someone else to decide if the separator should be compile time, but this made the code a lot nicer imo. Feel free to take it or leave it. :)

…ousands-separators

lib/pure/strmisc.nim

dom96 · 2020-10-03T11:42:58Z

Cool, can we also get the opposite of this? :)

lib/pure/strmisc.nim

…ousands-separators

lib/pure/strmisc.nim

…ousands-separators

tests/stdlib/tstrmisc.nim

…ousands-separators

timotheecour · 2020-12-02T23:53:48Z

tests/stdlib/tstrmisc.nim

+  doAssert parseFloatThousandSep("-Inf", {pfNanInf}) == -Inf
+  doAssert parseFloatThousandSep("+Inf", {pfNanInf}) == +Inf
+  doAssert parseFloatThousandSep("1000.000000E+90") == 1e93
+  doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001


parseFloatThousandSep("1e1") raises but should be accepted even without {pfDotOptional} because:

a user may not want to set {pfDotOptional} (too loose, eg would allow integers to be parsed as float)

nim: strutils.parseFloat accepts it

D: accepts it too (rdmd --eval 'writeln("1e1".to!double);')

python3: accepts it (float("1e1"))

timotheecour · 2020-12-02T23:59:39Z

@juancarlospaco I think https://github.com/nim-lang/Nim/pull/15421/files#r534562246 is my last comment, after that LGTM finally... unless i've missed some other case

EDIT: just found another bug: https://github.com/nim-lang/Nim/pull/15421/files#r534567618

please run sanity checks (or rather increase test coverage) before the next PTAL

timotheecour · 2020-12-03T00:08:01Z

tests/stdlib/tstrmisc.nim

+  doAssert parseFloatThousandSep("1000.000000E+90") == 1e93
+  doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001
+  doAssert parseFloatThousandSep("-10 000 000 000,0001", sep=' ', decimalDot = ',') == -10000000000.0001
+  doAssert classify(parseFloatThousandSep("NaN", {pfNanInf})) == fcNan


bug, this shouldn't be accepted:

nim> echo parseFloatThousandSep("inf.0", {pfNanInf}) 0.0

timotheecour · 2020-12-03T00:19:02Z

tests/stdlib/tstrmisc.nim

@@ -0,0 +1,59 @@
+import strmisc, math


please move the content of tests/stdlib/tstrmiscs.nim into here (tstrmiscs.nim was badly named)

timotheecour · 2020-12-03T00:24:42Z

lib/pure/strmisc.nim

+
+    proc parseFloatThousandSepRaise(i: int; c: char; s: openArray[char]) {.noinline, noreturn.} =
+      raise newException(ValueError,
+        "Invalid float containing thousand separators, invalid char $1 at index $2 for input $3" %


raise newException(ValueError, "Invalid float containing thousand separators, invalid char $1 at index $2 for input '$3'" % [$c, $i, s.join])

so that it shows as

'1,0000'

instead of

['1', ',', '0', '0', '0', '0']

timotheecour · 2020-12-03T00:32:38Z

lib/pure/strmisc.nim

+      parseFloatThousandSepRaise(0, sep, str)                 # "1,1"
+
+    if (strLen == 3 or strLen == 4) and (
+      (str[0] in {'i', 'I'} and str[1] in {'n', 'N'} and str[2] in {'f', 'F'}) or


cna you simplify this logic? it's buggy for 2 reasons:

https://github.com/nim-lang/Nim/pull/15421/files#r534567618 wrongly accepted

inconsistently defers error handling to parseFloat:

nim> echo parseFloatThousandSep("infx", {pfNanInf}) Error: unhandled exception: invalid float: infx [ValueError] nim> echo parseFloatThousandSep("infxx", {pfNanInf}) Error: unhandled exception: Invalid float containing thousand separators, invalid char , at index 0 for input 'infxx' [ValueError]

timotheecour · 2020-12-14T03:27:30Z

lib/pure/strmisc.nim

+    pfNanInf         ## Allow "NaN", "Inf", "-Inf", etc.
+
+  func parseFloatThousandSep*(str: openArray[char]; options: set[ParseFloatOptions] = {};
+      sep = ','; decimalDot = '.'): float =


can be discussed in future work, but just curious if there's a reverse proc for this?
I just found about insertSep, maybe they should cross-reference each other?
I'm not sure how robust insertSep is though; there's also formatFloat and formatEng, maybe they should be extended to support formatting with thousand separators

Araq · 2020-12-14T09:50:27Z

Sorry, rejected. For multiple reasons:

By design it cannot handle money which shouldn't be stored as a "float".
The openArray[char] interface is alien and not good enough for general parsing. (Hint: It doesn't return how many characters it returned so you're still left with a preprocessing step. Even if the string comes from an input field, you generally want to accept leading or trailing whitespace. So that would be yet another option pfAllowWhitespace)
The code is harder to use than the more hacky s.multiReplace({".": "", ",": ".").parseFloat (which would be quite appropriate for German btw).
The amount of options you can pass to the proc indicates that nobody really knows the use cases or that "every" use case should be covered. And yet, "money" is not among these.

timotheecour · 2020-12-16T19:18:14Z

/cc @Araq

For multiple reasons:

these are good arguments; how about the following instead:

proc parseIntCustom(a: string, start: int, T: typedesc[SomeInt], options: ParseOpt = {}): tuple[val: T, num: int] =
  ## val: parsed value; num: number of characters parsed on success or 0 on error
  runnableExamples:
    assert parseIntCustom("score: -1,234,567 name: foo", start = 5, int) == (-1,234,567, 10)
    assert parseIntCustom("--1", start = 0, int) == (0, 0) # because --
    assert parseIntCustom("-1", start = 0, uint) == (0, 0) # because uint

By design it cannot handle money which shouldn't be stored as a "float".

=> an API for that can be built on top of parseIntCustom to parse the integral part of the amount

The openArray[char] interface is alien and not good enough for general parsing

proposed API fixes that

The code is harder to use than the more hacky

but the more hacky way doesn't help with input validation

The amount of options you can pass to the proc indicates

I think parseIntCustom hits a sweet spot between complexity and usefulness; by restricting it to just integers (signed/unsigned), you avoid complexity of handling FP numbers; note that parsing FP numbers with thousand sep can use parseIntCustom as a building block, so can parsing of money.

Add parseFloatThousandSep

1ecf6ed

Araq reviewed Sep 29, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

juancarlospaco marked this pull request as draft September 29, 2020 10:25

juancarlospaco added 2 commits September 29, 2020 19:01

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

817df69

…ousands-separators

https://github.com/nim-lang/Nim/pull/15421#issuecomment-700791178

0b79ec4

juancarlospaco marked this pull request as ready for review September 29, 2020 22:43

juancarlospaco requested a review from Araq September 29, 2020 22:43

Vindaar reviewed Sep 29, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

Vindaar reviewed Sep 29, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

juancarlospaco added 2 commits September 29, 2020 20:13

https://github.com/nim-lang/Nim/pull/15421#issuecomment-700791178

53c3e18

https://github.com/nim-lang/Nim/pull/15421#issuecomment-700791178

b3ada57

Araq reviewed Oct 5, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

Araq reviewed Oct 5, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

Araq reviewed Oct 5, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

Araq reviewed Oct 5, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

Araq reviewed Oct 5, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

juancarlospaco added 3 commits October 5, 2020 13:55

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

ec57030

…ousands-separators

feedback

d63112c

feedbacks

590702a

juancarlospaco requested a review from Araq October 6, 2020 03:27

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

2163a46

…ousands-separators

Clyybber reviewed Oct 15, 2020

View reviewed changes

lib/pure/strmisc.nim Outdated Show resolved Hide resolved

juancarlospaco added 2 commits October 15, 2020 09:51

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

5b2b630

…ousands-separators

https://github.com/nim-lang/Nim/pull/15421#discussion_r505507037

af6821a

juancarlospaco added 8 commits November 24, 2020 15:42

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

cc24825

…ousands-separators

is always scientific

15d832b

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

4769a90

…ousands-separators

https://github.com/nim-lang/Nim/pull/15421#discussion_r529793041

12854c1

Clean out

d747b18

moar test

093b826

sep and dot must not be '+', 'e', 'i', 'n', 'f', 'a'

c61993e

sep and dot must not be '+', 'e', 'i', 'n', 'f', 'a'

8040f52

juancarlospaco marked this pull request as ready for review November 24, 2020 20:35

juancarlospaco requested review from timotheecour and Araq November 27, 2020 15:39

timotheecour reviewed Nov 29, 2020

View reviewed changes

tests/stdlib/tstrmisc.nim Outdated Show resolved Hide resolved

timotheecour reviewed Nov 29, 2020

View reviewed changes

tests/stdlib/tstrmisc.nim Outdated Show resolved Hide resolved

juancarlospaco added 2 commits November 29, 2020 11:01

Merge branch 'devel' of https://github.com/nim-lang/Nim into float-th…

92b2ae6

…ousands-separators

We need doAssert that takes varargs for tests

fcc5c35

juancarlospaco requested a review from timotheecour November 29, 2020 14:56

timotheecour reviewed Dec 2, 2020

View reviewed changes

timotheecour reviewed Dec 3, 2020

View reviewed changes

juancarlospaco marked this pull request as draft December 3, 2020 02:15

timotheecour reviewed Dec 14, 2020

View reviewed changes

Araq closed this Dec 14, 2020

timotheecour mentioned this pull request Dec 16, 2020

misc parsing timotheecour/Nim#461

Open

1 task

timotheecour mentioned this pull request Apr 3, 2021

Digit grouping #11734

Open

timotheecour mentioned this pull request Jun 28, 2021

add fmtFloat to stdlib timotheecour/Nim#766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parseFloatThousandSep #15421

Add parseFloatThousandSep #15421

juancarlospaco commented Sep 28, 2020 •

edited by timotheecour

Loading

Vindaar commented Sep 29, 2020 •

edited

Loading

juancarlospaco commented Sep 29, 2020 •

edited

Loading

juancarlospaco commented Sep 29, 2020

Vindaar commented Sep 29, 2020

Vindaar commented Sep 29, 2020 •

edited

Loading

dom96 commented Oct 3, 2020

timotheecour Dec 2, 2020

timotheecour commented Dec 2, 2020 •

edited

Loading

timotheecour Dec 3, 2020

timotheecour Dec 3, 2020

timotheecour Dec 3, 2020

timotheecour Dec 3, 2020 •

edited

Loading

timotheecour Dec 14, 2020

Araq commented Dec 14, 2020 •

edited

Loading

timotheecour commented Dec 16, 2020 •

edited

Loading

Add parseFloatThousandSep #15421

Add parseFloatThousandSep #15421

Conversation

juancarlospaco commented Sep 28, 2020 • edited by timotheecour Loading

links

Vindaar commented Sep 29, 2020 • edited Loading

juancarlospaco commented Sep 29, 2020 • edited Loading

juancarlospaco commented Sep 29, 2020

Vindaar commented Sep 29, 2020

Vindaar commented Sep 29, 2020 • edited Loading

dom96 commented Oct 3, 2020

timotheecour Dec 2, 2020

Choose a reason for hiding this comment

timotheecour commented Dec 2, 2020 • edited Loading

timotheecour Dec 3, 2020

Choose a reason for hiding this comment

timotheecour Dec 3, 2020

Choose a reason for hiding this comment

timotheecour Dec 3, 2020

Choose a reason for hiding this comment

timotheecour Dec 3, 2020 • edited Loading

Choose a reason for hiding this comment

timotheecour Dec 14, 2020

Choose a reason for hiding this comment

Araq commented Dec 14, 2020 • edited Loading

timotheecour commented Dec 16, 2020 • edited Loading

juancarlospaco commented Sep 28, 2020 •

edited by timotheecour

Loading

Vindaar commented Sep 29, 2020 •

edited

Loading

juancarlospaco commented Sep 29, 2020 •

edited

Loading

Vindaar commented Sep 29, 2020 •

edited

Loading

timotheecour commented Dec 2, 2020 •

edited

Loading

timotheecour Dec 3, 2020 •

edited

Loading

Araq commented Dec 14, 2020 •

edited

Loading

timotheecour commented Dec 16, 2020 •

edited

Loading