Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When delim=groupmark=x, treat x as delim unless input is quoted #182

Merged
merged 6 commits into from
Nov 9, 2023

Conversation

nickrobinson251
Copy link
Collaborator

@nickrobinson251 nickrobinson251 commented Nov 6, 2023

co-authored with @Drvi

When parsing unquoted numbers, and delim and groupmark are set to the same character, what should happen?
e.g. "1,foo" or "1,\"foo\"" or "1,2.3,foo" or "1,2.3,\"foo\""?

On main there is no documented behaviour for this situation.

In practice current behaviour on main is:

julia> for T in (Int, Float64)
          println(T)
          for c in ("1,foo","1,\"foo\"","1,2.3,foo","1,2.3,\"foo\"")
              print(rpad(repr(c), 15), " => ")
              println(Parsers.xparse(T, c; delim=',', groupmark=','))
           end
       end
Int64
"1,foo"         => Parsers.Result{Int64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=5, val=1)
"1,\"foo\""     => Parsers.Result{Int64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=7, val=1)
"1,2.3,foo"     => Parsers.Result{Int64}(code=`INVALID: OK | DELIMITED | INVALID_DELIMITER `, tlen=6, val=12)
"1,2.3,\"foo\"" => Parsers.Result{Int64}(code=`INVALID: OK | DELIMITED | INVALID_DELIMITER `, tlen=6, val=12)
Float64
"1,foo"         => Parsers.Result{Float64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=5, val=0.0)
"1,\"foo\""     => Parsers.Result{Float64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=7, val=0.0)
"1,2.3,foo"     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=6, val=12.3)
"1,2.3,\"foo\"" => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=6, val=12.3)

With this PR the proposed behaviour is:

Int64
"1,foo"         => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,\"foo\""     => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,2.3,foo"     => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,2.3,\"foo\"" => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
Float64
"1,foo"         => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,\"foo\""     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,2.3,foo"     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,2.3,\"foo\"" => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)

That is, the current behaviour on main is to "try" to treat the character as a groupmark if possible, e.g. with Float64 the input "1,2.3,foo" parses to 12.3 with the first , being interpreted as a groupmark due to the following number, but "1,foo" parses to 0.0 (not 1.0). And withInt "1,foo" parses to 1 but is marked INVALID whereas with Float64 1,2.3,foo parses to 12.3 but is marked OK. These inconsistencies are due to the parsers having no principled approach to resolving the ambiguous , character.

This PR changes the parsers to instead treat the character as a delim unless inside a quoted field. i.e. when groupmark and delim are the same (e.g. both ,) the input 1000,foo and \"1,000\",foo" both parse to 1000, but 1,000,foo parses to 1.

This brings groupmark inline with how we handle special or ambiguous characters when parsing to String, i.e. they can only appear in quoted values

Is this a breaking change?

The old behaviour was not documented, inconsistent, and an artefact of implementation details. However, this PR does change the behaviour of some existing tests, since unfortunately our tests were not written with a careful eye for delim==groupmark cases.

The tests actually tested both that we could and that we couldn't parse numbers in these cases, e.g. we had both

@test Parsers.xparse(Float64, "100,000,000.99,100"; groupmark=',', openquotechar='"', closequotechar='"') == Parsers.Result{Float64}(Int16(9), 15, 1.0000000099e8)

i.e. testing we got code='SUCCESS: OK | DELIMITED ' and val=1.0000000099e8
and

res = Parsers.xparse(Int64, "100,000,000,aaa"; groupmark=',')
@test res.code == EOF | INVALID | INVALID_DELIMITER

i.e. testing we code code='INVALID: EOF | INVALID_DELIMITER ' and not testing the value

So whilst it's always possible for someone to come to rely on a bug, i don't think there's any consistent behaviour here that someone could reasonably have been relying upon

Secondly, all tests in CSV.jl, WeakRefStrings.jl, JSON.jl, InlineStrings.jl, PowerFlowData.jl, JSON3.jl pass with this change, which gives some further validation that the old behaviour wasn't relied upon (at least as far as these test suites reflect usage).

Since breaking changes are not entirely free (all the dependency packages need to update, even though they don't require any code changes), i'll mark this as a new feature (we now support groupmark==delim with consistent semantics).

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d1c6fc5) 88.36% compared to head (d7e15c6) 88.42%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
+ Coverage   88.36%   88.42%   +0.06%     
==========================================
  Files          10       10              
  Lines        1882     1892      +10     
==========================================
+ Hits         1663     1673      +10     
  Misses        219      219              
Files Coverage Δ
src/Parsers.jl 88.55% <100.00%> (+0.59%) ⬆️
src/floats.jl 88.93% <100.00%> (ø)
src/ints.jl 93.54% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/runtests.jl Show resolved Hide resolved
Copy link
Collaborator

@Drvi Drvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nickrobinson251 nickrobinson251 changed the title When delim=groupmark=x, treat x as delim When delim=groupmark=x, treat x as delim unless input is quoted Nov 9, 2023
@nickrobinson251 nickrobinson251 marked this pull request as ready for review November 9, 2023 19:15
@nickrobinson251 nickrobinson251 merged commit 0eb5f46 into main Nov 9, 2023
12 checks passed
@nickrobinson251 nickrobinson251 deleted the npr-delim-outweighs-groupmark branch November 9, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants