Skip to content

substr index calculation does not match JavaScript implementation #27

@panzi

Description

@panzi

Assuming the JavaScript implementation if JsonLogic is normative, then the substr implementation in this library is wrong. This is because in JavaScript strings are stored as UTF-16 in memory and indexing happens on UTF-16 code unis. Meaning unicode code points that need two UTF-16 code units can be individually addressed in JavaScript. A string with a single such character has the length of 2 in JavaScript.

To achieve compatibility you would need to convert the string to UTF-16 (Vec<u16>), perform the operations on that, and then convert it back. However, this now means the JsonLogic could create invalid unicode data when it only cuts out a single surrogate of a surrogate pair. Rust doesn't allow strings broken like that (String::from_utf16() returns a Result and String::from_utf16_lossy() may return a string different to what would happen in JavaScript).

I don't know what best to do, but in any case this needs to be documented so that users are aware of any difference to the JavaScript implementation.

Another even bigger problem (that should be fixed more easily): The substr implementation in this library used String::len(), which returns the number of bytes in the string! Strings in Rust are UTF-8 and as such this implementation would already fail on my last name ("Panzenböck"). I.e. it fails on any non-7-bit-ASCII characters. Instead of that string.chars().len() (or something more intelligent that doesn't iterate all chars twice) should be used. This has nothing to do with the UTF-16 problem from above, but without that this implementation isn't if consistent in itself.

Yet another problem: Type coercionrules don't match those of the JavaScript implementation.
There you can use any garbage as the indices and it will convert the garbage into 0. And
strings of numbers just work like the numbers they represent and arrays with a single element
just act like their single element etc. This will probably also be a problem in other operations, but I have only checked substr.

Example 1:

Diverging UTF-16 handling.

Rule:

{"substr": [{"var":"s"},1]}

Data:

{"s":"\uD80C\uDC00"}

JsonLogic playground output:

"\udc00"

json-logic-rs output:

""

Example 2:

Broken non-ASCII character handling.

Rule:

{"substr": [{"var":"s"},0,-2]}

Data:

{"s":"äöü"}

JsonLogic playground output:

"ä"

json-logic-rs output:

"äöü"

Example 3:

Diverging type coercion.

Rule:

{"substr": [{"var":"s"},[1]]}

Data:

{"s":"abc"}

JsonLogic playground output:

"bc"

json-logic-rs output:

Error: Could not execute logic

Caused by:
    Invalid argument for 'substr' - 'Array([Number(1)])', reason: Second argument to substr must be a number

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions