Fixes #302 #374

sushantdhiman · 2016-08-15T18:29:45Z

Tasks

Test for encoding issues
Grab charset and install iconv-lite
Finding reason for corrupt INSERT INTO and fixing them
Use iconv for parsing string to proper mysql format (cesu8)
Prepare map between charset and encodings
Use iconv for parsing string to proper mysql format (actual mapped charset)
Use charset from config (connection, server) for proper encoding
Integration tests for connection, server charset

sidorares · 2016-08-15T22:09:00Z

@sushantdhiman see also mysqljs/mysql#1408 (comment)

( and 488e0e6 )

sushantdhiman · 2016-08-17T18:04:34Z

UTF8MB4_UNICODE_CI is working as expected. In previous tests table was created with latin1 and we tried to insert non-latin characters into it thus resulting ????

Although one problem remains, fieldName ( failing test ) are not parsed properly, to solve this I need to access charset and pass it to readLengthCodedString, where I can use iconv-lite to do the conversion if charset is passed. Main issue is packet data is accessed in some serial offset based protocol here, I need to access charset a few steps back here

sidorares · 2016-08-18T00:54:47Z

"some serial offset based protocol" - this actually helps here ( to overcome the fact that charset is further down packet and we need to calculate it's position before we start decoding schema/catalog/table/orgTable ) The lazy accessor is used to avoid unnecessary buffer->string conversion if schema/catalog/table/orgTable are not used

We still to use characterSet to decode name, so I suggest we'll have something like this:

  function readString(buffer, start, end, charset) {
     // add iconv charset conversion and respect charset parameter. Optimise in a way that it's likely called with utf8 most of the time
     return buffer.utf8Slice(start, end);
  }

  // name is always used, don't make it lazy
  // this.name = packet.readLengthCodedString();
  var _nameLength = packet.readLengthCodedNumber();
  var _nameStart = packet.offset;
  packet.offset += _nameLength;

  this._orgNameLength = packet.readLengthCodedNumber();
  this._orgNameStart = packet.offset;
  packet.offset += this._orgNameLength;

  packet.skip(1); //  length of the following fields (always 0x0c)
  this.characterSet = packet.readInt16();

  // now decode name once we have characterSet
  this.name = readString(this._buf, _nameStart, _nameStart + _nameLengt, this.characterSet);

  // modify getters code to respect charset as well

  var addString = function (name) {
  Object.defineProperty(ColumnDefinition.prototype, name, {get: function () {
    var start = this['_' + name + 'Start'];
    var end = start + this['_' + name + 'Length'];
    return readString(this._buf, start, end, this.characterSet);
  }});
};

sushantdhiman · 2016-08-18T14:49:12Z

Now I just need to prepare a proper map between charset codes and iconv-encodings, this one will need a bit of research work to see which charset belongs to which encoding

sidorares · 2016-08-18T15:33:53Z

we need to add conversion to column value itself (probably no need to convert numeric/date strings as they can only contain ascii):

node-mysql2/lib/compile_text_parser.js

Line 149 in b2010f9

return 'packet.readLengthCodedString()';

node-mysql2/lib/compile_binary_parser.js

Line 153 in b2010f9

return 'packet.readLengthCodedString();';

Also maybe worth creating decoder only once in row parser? (might be tricky as we potentially need to have one decoder per field - not sure if overhead worth complexity, and this is just for non-utf8 users)

https://github.com/ashtuchkin/iconv-lite/blob/69a25dc19c67de1b022ccfafbc5e41c2114f3486/lib/index.js#L36

sushantdhiman · 2016-08-21T12:19:03Z

After testing with Wireshark and studying MySQL docs. Here are a few points to consider

MySQL doesn't support UTF8MB4 for table meta data (Their Docs)

Every meta data is internally stored in form of UTF8, thus this test will always fail on UTF8MB4. Which I think try to convert multibyte data to some UTF8 representation and comes up with ????

Using CLI [MySQL 5.5 , 'UTF8MB4']

Using CLI [MySQL 5.5 , 'UTF8']

Apart from field meta data all selection will work fine. Everything is handled by MySQL. But this field meta data conversion is done at MySQL level we can't do anything about it.

Even if we connect with UTF8 charset and switch to UTF8MB4 later on this effect will remains, thus converting any multibyte data to ???? in all cases (for field meta data with utf8mb4)

I have modified said test so we use some proper field name to access the data.

Need your thoughts on this @sidorares , Unless MySQL doesn't start handling this case properly can we do anything about it ?

sidorares · 2016-08-21T12:27:32Z

@dougwilson - would be really great if you can add your opinion, I think you have much more experience on mysql side

@sushantdhiman I'll try to have close look tomorrow and give some feedback

sidorares · 2016-08-23T05:27:24Z

@sushantdhiman not sure if meta data (Their Docs) docs are still valid. What I see is when UTF8_GENERAL_CI is set SELECT "💩" query returns [ TextRow { '💩': '💩' } ] row, and both column name and column content transmitted as 4 byte utf8 character - 0xF0 0x9F 0x92 0xA9. When UTF8MB4_UNICODE_CI is used the column data is sent as same 4 bytes, but metadata (column name) is replaced with ? character.

sushantdhiman · 2016-08-23T05:34:14Z

Thats what bugs me too, isnt UTF8MB4 supposed to be a superset of UTF8, then why meta data is returned (converted) as ???

sidorares · 2016-08-23T05:39:49Z

I would expect opposite, this is how I read documentation

Mysql UTF8 is not actually UTF8, it's a subset of utf8 for code points encoded with up to 3 bytes
UTF8MB4 is "normal", standard utf8.
Field names still not allowed to contain chars outside of BMP ( 3 bytes encoding ), whether it's UTF8 or UTF8MB4

The reality seems to be very different

sidorares · 2016-08-23T05:42:13Z

then why meta data is returned (converted) as ???

This is kind of expected as per https://dev.mysql.com/doc/refman/5.6/en/charset-metadata.html (it's converted to ? because character is outside of BMP)

What I don't understand is why column name is not converted to ??? when UTF8_GENERAL_CI charset is selected

sushantdhiman · 2016-08-23T05:46:36Z

What I don't understand is why column name is not converted to ??? when UTF8_GENERAL_CI charset is selected

Exactly, why UTF8 is able to convert it but UTF8MB4 isnt. Its exact opposite if we consider UTF8MB4 is capable of handling 4 byte where UTF8 isnt.

May be in case of UTF8 everything is already in 3 bytes so that data fits easily, in case of UTF8MB4 they try to convert 4 byte representation to 3 bytes , resulting loss of data thus giving us ???

sidorares · 2016-08-23T05:50:11Z

May be in case of UTF8 everything is already in 3 bytes

No, I thought it might split it into 2 two bytes surrogate pairs but later realised that this is for utf-16 only. And bytes actually sent over the wire are 0xF0 0x9F 0x92 0xA9 for one single unicode character

sidorares · 2016-08-23T05:51:44Z

I'll try to test this against older versions of server

sidorares · 2016-08-23T06:32:05Z

iconv-lite encodings list: https://github.com/ashtuchkin/iconv-lite/wiki/Supported-Encodings

dougwilson · 2016-08-23T13:22:18Z

Very odd indeed, though UTF-8 in MySQL is a mess in some regards... I have never personally even tried to use non-ASCII identifiers, partly because it's easier, and partly because I'm just English-centric :) It does seem odd that UTF8BM4 would not work in column metadata, though UTF8 would. If it helps as well, in MySQL, what is called "UTF8" is actually the "CESU-8" encoding and what is called "UTF8MB4" is actually the "UTF-8" encoding.

Both support characters outside the BMP just fine (not sure on metadata), though of course if the charset in MySQL is set to UTF8 then the driver will need to send CESU-8 encoded bytes rather than UTF-8 encoded bytes or non-BMP chars will get mangled. UTF8MB4 provides better support for non-BMP chars in that it understands them as one character instead of being made of two different characters (JavaScript, C#, and some other langs also consider non-BMP to be two characters).

sidorares · 2016-08-23T13:32:14Z

here is what I think is going on:

with utf8_unicode_ci when character is outside of BMP server treats is like unknown 4 (or more) bytes stream. When received back on node side those 4 bytes assembled back into single codepoint. This applies to data and metadata. This is why field name chars are preserved.
with utf8mb4_unicode_ci server is aware of characters outside of BMP and treats them as individual codepoints. For metadata, characters outside of BMP are replaced with ? ( 0x3f ) beacause of some internal restrictions

Example:

var connection = require('mysql2').createConnection({
  charset: 'utf8mb4_unicode_ci'
});
connection.query('SELECT CHAR_LENGTH("𝌆") ', (err, rows, fields) => console.log(rows, fields));

returns ( note that CHAR_LENGTH("𝌆") is 1 )

[ TextRow { 'CHAR_LENGTH("?")': 1 } ] [ { catalog: 'def',
    schema: '',
    name: 'CHAR_LENGTH("?")',
    orgName: '',
    table: '',
    orgTable: '',
    characterSet: 63,
    columnLength: 10,
    columnType: 8,
    flags: 129,
    decimals: 0 } ]

While same script, but with utf8_unicode_ci charset connection option results in

[ TextRow { 'CHAR_LENGTH("𝌆")': 4 } ] [ { catalog: 'def',
    schema: '',
    name: 'CHAR_LENGTH("𝌆")',
    orgName: '',
    table: '',
    orgTable: '',
    characterSet: 63,
    columnLength: 10,
    columnType: 8,
    flags: 129,
    decimals: 0 } ]

Note that CHAR_LENGTH returned is 4, same as LENGTH would return.

With this in mind utf8mb4_unicode_ci does look like a safer default option.

sidorares · 2016-08-23T13:38:37Z

@dougwilson

Both support characters outside the BMP just fine (not sure on metadata), though of course if the charset in MySQL is set to UTF8 then the driver will need to send CESU-8 encoded bytes rather than UTF-8 encoded bytes or non-BMP chars will get mangled

I don't think this is true, looks like when non-mb4 utf8 charset is selected and character is encoded using more than 3 octets server gets confused and just treats them as 4 or more characters

dougwilson · 2016-08-23T13:49:54Z

That is because you are not encoding properly over the wire. You need to encode in CESU-8 when using the UTF8 charset.

sidorares · 2016-08-23T13:52:22Z

You mean when the data is sent to server? Because when I receive it back it looks like it's normal utf8

sidorares · 2016-08-23T13:55:00Z

ashtuchkin/iconv-lite#102 :)

dougwilson · 2016-08-23T13:55:20Z

In CESU-8, a non-BMP character is encoded as six bytes, not four. This is because it is encoded as a pair of three byte sequences. CESU-8 is a dumb encoding used in the late 90s/early 00s and made it's way into various dbs like MySQL. They went back and defined a charaet that uses UTF-8 and called it UTF8MB4. But otherwise, when using the UTF8 charset, you need to use CESU-8 encoding over the wire.

dougwilson · 2016-08-23T13:58:39Z

lol, yea, that iconv-lite issue was for talking to an existing MySQL server that contained CESU-8 data :)

sidorares · 2016-09-03T11:05:43Z

what I see now: cesu8 works ok (rows and fields) for non-bmp chars if read and write by node client. mysql client fails to read them ( set names utf8 or set names utf8mb4 )

utf8mb4 works for data ( written by node client, read by mysql command line client ) nut not for metadata ( converted to ? server side )

Looks like only practical advise for new data: use UTF8MB4 for connection encoding but never use non-mb4 in field names. Old data: figure out what's there, if you have cesu-8 encoded chars connect with UTF8_GENERAL_CI otherwise connect as UTF8MB4_UNICODE_CI .

sidorares · 2016-09-05T23:27:28Z

@sushantdhiman I think this is ready. Maybe you can have one final look again just to be sure an I'll mere it (or you merge)

sushantdhiman · 2016-09-06T05:44:31Z

👏 👏 🎉

sushantdhiman · 2016-09-06T14:34:30Z

@sidorares rc-12 is not on npm yet, right ?

sidorares · 2016-09-06T21:53:22Z

@sushantdhiman it is now, wanted tag to pass CI before publishing

I think this is last rc before 1.0.0 - this encoding stuff was the main reason I was holding release, hopefully after we'll have more frequent patch/major versions

sushantdhiman added 2 commits August 15, 2016 22:52

added test for charset fail

4105981

Added the icov for later encoding parsing

f8fe05b

sidorares added the in progress label Aug 15, 2016

sushantdhiman added 2 commits August 17, 2016 22:49

needed to create non latin table

e81814c

removed the iconv-lite

580a17d

Merge branch 'master' into fix-char-encoding

98800c5

sushantdhiman force-pushed the fix-char-encoding branch from c0a8535 to 98800c5 Compare August 21, 2016 11:42

Andrey Sidorov and others added 12 commits September 3, 2016 21:35

use utf8 for UTF8MB4 charsets

ece3707

serialize flags as unsigned 32 bit int

5e215c2

use real flags in server example

13da124

allow to set charset in test helper

771da9c

add tests for non-BMP chars

17d48fb

UTF8MB4 is utf8 now, not cesu8

603780e

remove outdated test

78ed3da

add client encoding integration test

e8ce779

build: bump to node 6.5 on travis

375fa75

test utf8 instead utf8mb4

5ffeb7d

convert mysql encoding name to iconv name

4a6e081

use explicit offset in writeUInt32LE

96d6e56

sushantdhiman merged commit b53f6e2 into master Sep 6, 2016

sushantdhiman deleted the fix-char-encoding branch September 6, 2016 05:44

sidorares removed the in progress label Sep 6, 2016

sidorares mentioned this pull request Oct 3, 2016

Chinese column names become garbled in mysql2^1.0.0 #422

Closed

denji mentioned this pull request Jan 17, 2017

Unknown charset 'utf8mb4…' #492

Closed

sidorares mentioned this pull request Jan 17, 2017

MySQL 5.6/8.0 charset #494

Merged

13349253469 mentioned this pull request Mar 25, 2019

What's the problem with starting an error? #936

Closed

sidorares mentioned this pull request Oct 13, 2020

JSON type - Encoding issue #1228

Closed

sidorares mentioned this pull request Oct 31, 2020

Possible utf8 <> utf8mb4 confusion #1240

Open

This was referenced Oct 6, 2021

(node:12454) UnhandledPromiseRejectionWarning: Error: Encoding not recognized #1333

Closed

Update collation list up to MySQL 8.0.26 #1410

Merged

sidorares mentioned this pull request Mar 4, 2023

Conversion from collation utf8mb4_unicode_ci into utf8mb3_general_ci impossible for parameter #1865

Closed

sidorares mentioned this pull request Sep 3, 2024

Add support for utf8mb3 #3007

Closed

Uh oh!

Fixes #302 #374

Fixes #302 #374

Uh oh!

Conversation

sushantdhiman commented Aug 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidorares commented Aug 15, 2016

Uh oh!

sushantdhiman commented Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidorares commented Aug 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sushantdhiman commented Aug 18, 2016

Uh oh!

sidorares commented Aug 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sushantdhiman commented Aug 21, 2016

Uh oh!

sidorares commented Aug 21, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sushantdhiman commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sushantdhiman commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

dougwilson commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

dougwilson commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

sidorares commented Aug 23, 2016

Uh oh!

dougwilson commented Aug 23, 2016

Uh oh!

dougwilson commented Aug 23, 2016

Uh oh!

sidorares commented Sep 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidorares commented Sep 5, 2016

Uh oh!

sushantdhiman commented Sep 6, 2016

Uh oh!

sushantdhiman commented Sep 6, 2016

Uh oh!

sidorares commented Sep 6, 2016

Uh oh!

Uh oh!

sushantdhiman commented Aug 15, 2016 •

edited

Loading

sushantdhiman commented Aug 17, 2016 •

edited

Loading

sidorares commented Aug 18, 2016 •

edited

Loading

sidorares commented Aug 18, 2016 •

edited

Loading

sidorares commented Sep 3, 2016 •

edited

Loading