Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand mangles unicode characters stored in JSON column #57

Closed
ViktorNacht opened this issue Mar 1, 2019 · 55 comments
Closed

expand mangles unicode characters stored in JSON column #57

ViktorNacht opened this issue Mar 1, 2019 · 55 comments
Assignees

Comments

@ViktorNacht
Copy link

I can create a working example, but just a quick heads-up that I'm getting mangled unicode when I use the expand function selecting data from a JSON column (technically I was using expand(1)).

I confirmed this by removing expand and then de-serializing it with j and the unicode characters remained in-tact.

FYI, my DSN is something like:

mysql://foo:bar@db.us-west-2.rds.amazonaws.com/table_name?mysql_enable_utf8mb4=1

And I don't know if it matters, but I'm inserting the data by serializing it with j first. I haven't figured out how to use json with SQL::Abstract::mysql

@Grinnz
Copy link
Contributor

Grinnz commented Mar 1, 2019

Serializing with j is incorrect. Since the query and parameters are being encoded to UTF-8 already, you want to use to_json so as not to double-encode. Alternatively, use j or encode_json and bind it as a SQL_BLOB as shown in https://metacpan.org/pod/Mojo::mysql::Database#query. But you might as well just use the json generator.

@ViktorNacht
Copy link
Author

Cool. Not to disagree with you, but to make sure I understand what's happening.

I'm using $db->insert() because passing a hash reference with the column names and values is better for my application. Because I didn't see a way to use the json generator without using $db->query, I encoded the json column with j, and then inserted with $db->insert().

Selecting the json column with the mysql CLI and via a graphical interface both show the ñ character correctly, so it appears it was not double-encoded.

Only when I use ->expand does it get mangled. Using $db->select() and j to deserialize returns the proper character.

What I am missing I this process? I'm just curious how using j in and j out works fine, but mixing j with ->expand doesn't.

Thank you for your time, I appreciate the learning process.

@Grinnz
Copy link
Contributor

Grinnz commented Mar 1, 2019

It is hard to debug encoding issues because the tools you use to debug also work in a certain encoding. When you select (with utf8mb4 enabled) and then use j to deserialize, both of those steps decoded, so it successfully double-decodes the double encoded data.

@Grinnz
Copy link
Contributor

Grinnz commented Mar 1, 2019

You are correct that you can't use the json generator with the SQL::Abstract functions. SQL::Abstract::Pg added this feature (https://metacpan.org/pod/SQL::Abstract::Pg#JSON) but it hasn't been added to SQL::Abstract or SQL::Abstract::mysql yet.

@jhthorsen
Copy link
Owner

@ViktorNacht: You might want to have a look at #51. I'll merge the changes once @Tekki is happy with the implementation. Any feedback would be very much appreciated, so please take the PR for a spin.

@Tekki
Copy link
Collaborator

Tekki commented Mar 2, 2019

I'm using $db->insert() because passing a hash reference with the column names and values is better for my application.

The idea with my changes is that you won't have to JSON encode and decode anymore, just work with Perl hash and array references.
There are already some issues with ->expand (see #52), so it's possible you discovered a new one. If you can give us an example, we can include it in the tests.

@ViktorNacht
Copy link
Author

I installed #51 and I loved the fact it automagically $db->insert[ed] my arrayref of hashrefs as JSON. That's so awesome.

Using the mysql CLI and graphical interface the unicode characters were correctly inserted.

Using $result->expand(1) gave me the same mangled results as before. I also tried removing ->expand to see if it would auto-decode and it did not. I still had to use j.

I couldn't tell from the docs if there's a syntax I'm supposed to use to select an entire json column as an array/hashref, rather than the nest data.

@Tekki
Copy link
Collaborator

Tekki commented Mar 4, 2019

I'll keep an eye on the encoding when I modify the expand function.

@Tekki
Copy link
Collaborator

Tekki commented Mar 4, 2019

Until now I can't find encoding issues with ->expand. Replaced 'supergirl' with 'süpergörl' in t/json.t, both tests 'name as string' and 'name as hash' pass.
To what Grinnz said earlier I could add: Don't trust what you see in the console, don't trust what some tools show you, they may be too clever and correct the encoding on the fly.
Write a script with an editor that gives you full control over the encoding, use Mojo::Base or utf8 at the beginning. And to be sure everything works, don't take a word with Latin-1 characters, as this is the fallback character set in most of the cases, but something more exotic, like the Hungarian 'szőlőlé' (grape juice).
Two examples for output to a UTF-8 console: the following

use Mojo::Base -strict;
say 'szőlőlé';

produces either broken characters or a 'wide character' warning. The next one

use Mojo::Base -strict;
use Mojo::Util 'encode';
say encode 'UTF-8', 'szőlőlé';

prints the correct word.

@Tekki Tekki self-assigned this Mar 4, 2019
@ViktorNacht
Copy link
Author

For sure, I included grape juice :) in my test data and get the same results.

So if I use #51 to insert the data (by letting the module convert to JSON), and then I use $result->expand(1) and still get mangled unicode, where would you suggest looking for issues? There's no j or double-encoding (that I'm aware of). Would running the tests in #51 possibly help?

Keep in mind, all other unicode looks perfect, so it's not an encoding issue with the HTML. It's definitely the roundtrip of the JSON to and from the database.

@Tekki
Copy link
Collaborator

Tekki commented Mar 5, 2019

The question is if you get this error with the published code. #51 is still an experiment. Can you show us how to reproduce this issue, please?

@jhthorsen
Copy link
Owner

@ViktorNacht: There's really no way we can reproduce what you are saying without a small example application or a small test. Saying "unicode looks perfect" sounds to us as you're using your eyes, which isn't proof of anything.

@ViktorNacht
Copy link
Author

ViktorNacht commented Mar 7, 2019

Wow, you're right about not trusting your eyes. This is a small app I made mostly for myself to test the different ways of inserting and selecting JSON, and my results. This issue is probably specific to my system, so I have to be pragmatic and go with what works. But if this is helpful, or you suggest further testing that could be helpful, that's cool.

#!/usr/bin/env perl
use Mojolicious::Lite -signatures;
use Mojo::mysql;
use Mojo::JSON 'j';

app->secrets(['Poop']);

helper mysql => sub {
  state $mysql = Mojo::mysql->new('mysql://foo:bar@aws.rds.us-west-2.rds.amazonaws.com/table_name?mysql_enable_utf8mb4=1');
};

get '/query' => sub ($c) {
  my $drink = 'Szőlőlé';

  my $id = $c->mysql->db->query('INSERT INTO unicode_json_test SET my_json = ?', {json => {drink => $drink}})->last_insert_id;

  my $unicode_json_test = $c->mysql->db->select(unicode_json_test => '' => { id => $id })->expand(1)->hash;

  say $unicode_json_test->{my_json}->{drink} eq $drink ? 'Matches' : 'Fail';

  say $unicode_json_test->{my_json}->{drink};

  $c->render(text => $unicode_json_test->{my_json}->{drink});
};

get '/j' => sub ($c) {
  my $drink = 'Szőlőlé';

  my $object = j { drink => $drink };

  my $id = $c->mysql->db->insert(unicode_json_test => { my_json => $object })->last_insert_id;

  my $unicode_json_test = $c->mysql->db->select(unicode_json_test => '' => { id => $id })->hash;

  my $my_json = j $unicode_json_test->{my_json};

  say $my_json->{drink} eq $drink ? 'Matches' : 'Fail';

  say $my_json->{drink};

  $c->render(text => $my_json->{drink});
};

app->start;

'/query' uses the internal engine of the module, while '/j' using j going in and out of the DB.

The results are

  1. Results of comparison using eq
  2. saying to STDOUT
  3. Rendering to STDOUT using 'text=>'

With /query:

# ./unicode.pl get /query
Fail
Szőlőlé
Sz�l�l

With /j:

# ./unicode.pl get /j
Matches
Wide character in say at ./unicode.pl line 39.
Szőlőlé
Szőlőlé

(I understand why the wide character warning is there, I just left it for completeness)

@Grinnz
Copy link
Contributor

Grinnz commented Mar 7, 2019

To diagnose your issue, what is the column type and charset of the my_json column?

@ViktorNacht
Copy link
Author

CREATE TABLE `unicode_json_test` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `my_json` json NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=25 DEFAULT CHARSET=utf8mb4;

(the column charset is the default charset)

@Grinnz
Copy link
Contributor

Grinnz commented Mar 7, 2019

I suspect this is the intersection of two DBD::mysql bugs. First the well-known encoding bug where it uses input strings incorrectly as parameters, you could verify this by running utf8::upgrade $object; before inserting it, I suspect this would result in the same output as the first test. utf8::upgrade does not modify the string, only the internal representation, and that this affects DBD::mysql is a bug which is one of the primary reasons for the DBD::MariaDB fork.

That would leave us with why mysql_enable_utf8mb4 is not decoding the JSON column output, which I suspect is another bug. I am curious if DBD::MariaDB exhibits the same behavior there.

@ViktorNacht
Copy link
Author

you could verify this by running utf8::upgrade $object; before inserting it, I suspect this would result in the same output as the first test.

And you'd be 100% correct:

# ./unicode.pl get /j
Fail
Szőlőlé
Sz�l�l

@Grinnz
Copy link
Contributor

Grinnz commented Mar 7, 2019

Since DBD::MariaDB support is now merged, could you try with Mojo::mysql 1.13, DBD::MariaDB installed, and mariadb:// as the protocol?

(You would also need to remove the mysql_enable_utf8mb4 option, this is default behavior in DBD::MariaDB)

@ViktorNacht
Copy link
Author

ViktorNacht commented Mar 7, 2019 via email

@Grinnz
Copy link
Contributor

Grinnz commented Mar 7, 2019

Using the same server. DBD::MariaDB is just an unfortunate naming choice for the fork.

@ViktorNacht
Copy link
Author

FYI, using 'mariadb://' with 'mysql_enable_utf8mb4' removed:

# ./unicode.pl get /query 
Fail
Szőlőlé
Sz�l�lé

# ./unicode.pl get /j 
Fail
Szőlőlé
Sz�l�l

@Grinnz
Copy link
Contributor

Grinnz commented Mar 8, 2019

I unfortunately don't have access to a new enough MySQL or MariaDB server to test this, but to try to isolate the response case further to a query that could demonstrate the problem for bug reports, what happens if you compare the response of something like this? SELECT JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4)) AS my_json

@ViktorNacht
Copy link
Author

get '/convert' => sub ($c) {
  my $drink = 'Szőlőlé';

  my $unicode_json_test = $c->mysql->db->query(q"SELECT JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4)) AS my_json")->expand(1)->hash;

  say $unicode_json_test->{my_json}->{drink} eq $drink ? 'Matches' : 'Fail';

  say $unicode_json_test->{my_json}->{drink};

  $c->render(text => $unicode_json_test->{my_json}->{drink});
};
Matches
Wide character in say at ./unicode.pl line 52.
Szőlőlé
Szőlőlé

@Grinnz
Copy link
Contributor

Grinnz commented Mar 8, 2019

Interesting, so it decodes that correctly but not from the JSON column... To make sure it's selecting from the column that's the problem, can you try one last thing: INSERT INTO unicode_json_test SET my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4)) then selecting the column as you did before?

@ViktorNacht
Copy link
Author

get '/insert' => sub ($c) {
  my $drink = 'Szőlőlé';

  my $id = $c->mysql->db->query(q"INSERT INTO unicode_json_test SET my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4))")->last_insert_id;

  my $unicode_json_test = $c->mysql->db->select(unicode_json_test => '' => { id => $id })->expand(1)->hash;

  say $unicode_json_test->{my_json}->{drink} eq $drink ? 'Matches' : 'Fail';

  say $unicode_json_test->{my_json}->{drink};

  $c->render(text => $unicode_json_test->{my_json}->{drink});
};
# ./unicode.pl get /insert
Fail
Szőlőlé
Sz�l�l

FYI:

> SELECT VERSION();
+------------+
| VERSION()  |
+------------+
| 5.7.22-log |
+------------+

@Tekki
Copy link
Collaborator

Tekki commented Mar 8, 2019

@ViktorNacht, @Grinnz thanks to both of your for the additional information. I still had some difficulties to reproduce it with t/json.t until I realized that the 'supergirl' tests are made on the name instead of the j column. Instead it should be tested on line 18 - 23:

$db->query('insert into mojo_json_test (id, name, j) values (?, ?, ?)', $$, $0, {json => {drink => 'szőlőlé'}});

is $db->query('select json_type(j) from mojo_json_test')->array->[0],             'OBJECT', 'json_type';
is $db->query('select json_extract(j, "$.drink") from mojo_json_test')->array->[0], '"szőlőlé"',     'json_extract';
is_deeply $db->query('select id, name, j from mojo_json_test where json_extract(j, "$.drink") = "szőlőlé"')->expand->hash,
  {id => $$, name => $0, j => {drink => 'szőlőlé'}}, 'expand json';

I see no difference if I use the 'enable...' flags or not, but I have set everything to 'utf8mb4' on the server. Change the query in the last test to select id, name, json_extract(j, "$") as j... and it works.

@pali
Copy link

pali commented Mar 8, 2019

Hi! If you think that DBD::MariaDB does not correctly return JSON columns as Unicode strings, can you please apply following patch for DBD::MariaDB and provide output for SELECT which just returns that JSON column?

diff --git a/dbdimp.c b/dbdimp.c
index 264a2f14..035c8fed 100644
--- a/dbdimp.c
+++ b/dbdimp.c
@@ -5782,6 +5782,7 @@ process:
           /* TEXT columns can be returned as MYSQL_TYPE_BLOB, so always check for charset */
           if (mysql_charsetnr_is_utf8(fields[i].charsetnr))
             sv_utf8_decode(sv);
+          printf("i=%u name=%s charset=%u type=%d value=%s\n", i, fields[i].name, fields[i].charsetnr, (int)fields[i].type, col);
           break;
         }
       }

It prints to stdout information about column, charset and type.

@pali
Copy link

pali commented Mar 8, 2019

See my comment in perl5-dbi/DBD-MariaDB#142 (comment) . I do not see anything wrong with current behavior of DBD::MariaDB. UTF-8 JSONs are correctly decoded.

@Tekki
Copy link
Collaborator

Tekki commented Mar 8, 2019

@pali and all the others: I've no time to look at this before Sunday.

@Tekki
Copy link
Collaborator

Tekki commented Mar 10, 2019

Here is a test, similar to those from Mojo::mysql, but using DBD::mysql and DBD::MariaDB directly:

#!/usr/bin/env perl
use strict;
use warnings;

use DBI;
use Test::More;
use utf8;

plan skip_all => 'TEST_ONLINE=mysql://root@/test' unless $ENV{TEST_ONLINE};

my ($driver, $username, $password, $host, $port, $dbname) = $ENV{TEST_ONLINE} =~ m|(.+)://(.+):(.*)@(.*):(.*)/(.+)|;

my $dsn = qq|DBI:$driver:$dbname;host=$host;port=$port|;
note $dsn;

my $dbh = DBI->connect($dsn, $username, $password);

eval {
  $dbh->do('CREATE TABLE IF NOT EXISTS unicode_json_test(my_json JSON)');
  $dbh->do('TRUNCATE TABLE unicode_json_test');
  $dbh->do(
    q|INSERT INTO unicode_json_test SET my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4))|
  );
} or do {
  plan skip_all => $@;
};

is $dbh->selectrow_array('SELECT my_json FROM unicode_json_test'), q|{"drink": "Szőlőlé"}|;

$dbh->do('drop table unicode_json_test') unless $ENV{TEST_KEEP_DB};

done_testing;

Message if failed:

#   Failed test at t/grapejuice.t line 29.
#          got: '{"drink": "Sz� l� lé"}'
#     expected: '{"drink": "Szőlőlé"}'
# Looks like you failed 1 test of 1.

Overview over the results:

db DBD::mysql 4.050 DBD::MariaDB 1.21
MySQL 5.7 FAIL FAIL
MySQL 8.0 FAIL FAIL
MariaDb 10.3 FAIL PASS
MariaDB 10.4 FAIL PASS

@Tekki
Copy link
Collaborator

Tekki commented Mar 10, 2019

@jhthorsen Seems we have to deal with the situation that the drivers return inconsistent encodings for JSON columns. We have to test for 'szőlőlé' wherever possible.
If I change the above test like this

my $drink = $dbh->selectrow_array('SELECT my_json FROM unicode_json_test');
utf8::decode $drink;
is $drink, q|{"drink": "Szőlőlé"}|;

it succeeds in all 8 situations.

@pali
Copy link

pali commented Mar 10, 2019

Note that DBD::mysql has a Unicode Bug which is never going to be fixed (perl5-dbi/DBD-mysql#117) due to legacy reasons and it is reason why DBD::MariaDB fork was created. So if you need correct Unicode support, just do not use DBD::mysql... It would never work correctly and just cause problems, like double encoding or damanged data.

@Tekki, can you apply patch from #57 (comment) to the DBD::MariaDB and provide output for your tests? From your testing it looks like that DBD::MariaDB is correctly working with MariaDB server, but incorrectly with MySQL server. Also it proves the observation (that DBD::MariaDB with MariaDB server is working fine) from perl5-dbi/DBD-MariaDB#142

@Tekki
Copy link
Collaborator

Tekki commented Mar 10, 2019

We will solve all problems in Mojo::mysql and nobody will use DBI modules directly anymore. 💪 😄
(Maybe, we will see...)

@Tekki
Copy link
Collaborator

Tekki commented Mar 10, 2019

@Tekki, can you apply patch from #57 (comment) to the DBD::MariaDB and provide output for your tests?

db output
MySQL 5.7 i=0 name=my_json charset=63 type=245 value={"drink": "Szőlőlé"}
MySQL 8.0 i=0 name=my_json charset=63 type=245 value={"drink": "Szőlőlé"}
MariaDB 10.3 i=0 name=my_json charset=45 type=252 value={"drink": "Szőlőlé"}
MariaDb 10.4 i=0 name=my_json charset=45 type=252 value={"drink": "Szőlőlé"}

@pali
Copy link

pali commented Mar 10, 2019

@Grinnz wrote:

... MySQL documents that when JSON is interacted with as a string it is treated as the utf8mb4 charset ...

And @Tekki wrote:

MySQL 8.0 charset=63

Charset 63 is binary, not utf8mb4. Charset 45 is utf8mb4. So either MySQL documentation or @Grinnz's quote is not truth. MySQL 8.0 returned binary data, not UTF-8 data.

@pali
Copy link

pali commented Mar 10, 2019

But MariaDB server returns correctly charsrt 45 (utf8mb4). Therefore this is clear bug in MySQL server as it does not return charset for UTF-8 column correctly. Could you report bug to MySQL server bugtracker.

@Tekki
Copy link
Collaborator

Tekki commented Mar 11, 2019

@pali can you give me link where they say JSON should return utf8mb4? I only find this that I understand in a way that binary could be correct.

@Grinnz
Copy link
Contributor

Grinnz commented Mar 11, 2019

"Strings produced by converting JSON values have a character set of utf8mb4 and a collation of utf8mb4_bin" This means that the returned JSON will be returned as a utf8mb4 string (and apparently, with the output of JSON_OBJECT() or similar it is, but not with a JSON column)

@Tekki
Copy link
Collaborator

Tekki commented Mar 11, 2019

It will be returned as utf8mb4 if we convert it to a string type, but it will remain what it is if we don't convert it. JSON values on MySQL are objects, on MariaDB they are strings.
Is it completely wrong what I say?

@pali
Copy link

pali commented Mar 11, 2019

MYSQL protocol support just strings.

@pali
Copy link

pali commented Mar 19, 2019

@Tekki The main problem is that MySQL say that returned string is not in UTF-8 (but in binary), see: #57 (comment)

So DBD::MariaDB driver does not treat returned string from MySQL as Unicode string.
This is clear bug in MySQL server, so I would suggest to report it to the MySQL bugtracker.

It does not matter how are values stored on server (or disk). The only thing which matter is how server send them to the client (DBD::MariaDB) and what is the charset which server send to client.

Client (DBD::MariaDB) does not see nor know how are values stored on server. It just see information which server send to the client. And if server send wrong information then client of course cannot process it correctly.

@Tekki
Copy link
Collaborator

Tekki commented Mar 22, 2019

This situation is a bit unfortunate. We have contact to the developers of DBD::mysql and DBD::MariaDB and can decide which version needs to be installed. But if it comes to errors in MySQL, it's getting difficult. First of all I'm not the one who can discuss technical details with the developers on bugs.mysql.com. I don't understand enough of these protocols to decide if there is really an error in MySQL or if the differences are due to the different implementation of JSON in MySQL and MariaDB. Then, even if there is an error and they correct it, we still have no influence on the version that runs on the machines of our users. Most of them will simply install what is shipped with their distribution.
My opinion is still that we here in Mojo::mysql should find a way to deal with the inconsistent encoding of the JSON fields.

@pali
Copy link

pali commented Mar 22, 2019

@Tekki Hacking Mojo::mysql is not a solution. Once upstream decide to fix it, you would again have broken Mojo::mysql and in very bad irreparable way -- new version starts returning double encoded/decoded strings... Hacking server errors on client side, without knowledge of when and how is going problem fixed on server is way to the hell. You need to know how upstream fix server and based on this details implement workaround for older server. But this cannot be done without communication with upstream server.

@jhthorsen
Copy link
Owner

I disagree @pali.

I don't get how we're not able to do fix it in the client. We could even have code that checks the server version if we have to.

@pali
Copy link

pali commented Mar 22, 2019

You add workaround in client; server gets fixed and all existing code which uses client would be broken after upgrading server. People would not be happy to use such client which is going to be periodically broken by such changes...

But main question: Why you are going to workaround server problems without even communicating with server developers and coordinate what is correct workaround for particular problem? This does not make much sense.

@jhthorsen
Copy link
Owner

@pali: Because our users might not be able to upgrade to the latest and greatest version of the server.

When that is said, I'm not going to fix this, but I'm backing @Tekki's comment above.

@Tekki: What do you think we should do? Should we close this as "Won't fix"? This thread is getting too long, and I think it's diverging. (And at least I'm having difficult sorting the information)

@Grinnz
Copy link
Contributor

Grinnz commented Mar 22, 2019

I do not think it is feasible for Mojo::mysql to completely handle this unfortunate situation. Aside from if this actually gets fixed in MySQL, MariaDB (server) does not appear to have this bug, so the workaround would be wrong to apply in that case. There is no way to tell that you are getting json column data back rather than some other binary data (or even JSON stored in a blob column). I also do not know if anyone has tested with DBD::mysql and MariaDB as the server.

@Grinnz
Copy link
Contributor

Grinnz commented Jun 7, 2019

FYI: I have opened a bug against MySQL here, but I still have no way to reproduce or test it myself. https://bugs.mysql.com/bug.php?id=95698

@Tekki
Copy link
Collaborator

Tekki commented Jun 8, 2019

Thanks @Grinnz, will be interesting to read their answer.

We should all be able to test these things, we'll have to find a solution for that.

@Tekki
Copy link
Collaborator

Tekki commented Jun 10, 2019

The MySQL bug is now set to 'verified'.

@pali
Copy link

pali commented Jun 10, 2019

What does verified mean? That it was accepted as a bug? Or as it was already fixed and verified that fix is correct?

@Grinnz
Copy link
Contributor

Grinnz commented Jun 10, 2019

Verified that the bug exists, I believe.

@jhthorsen
Copy link
Owner

I believe this conversation has gone stale. Please let open a fresh issue if changes should be made in Mojo::mysql.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants