Skip to content

Fixes #302 #374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Sep 6, 2016
Merged

Fixes #302 #374

merged 35 commits into from
Sep 6, 2016

Conversation

sushantdhiman
Copy link
Collaborator

@sushantdhiman sushantdhiman commented Aug 15, 2016

Tasks

  • Test for encoding issues
  • Grab charset and install iconv-lite
  • Finding reason for corrupt INSERT INTO and fixing them
  • Use iconv for parsing string to proper mysql format (cesu8)
  • Prepare map between charset and encodings
  • Use iconv for parsing string to proper mysql format (actual mapped charset)
  • Use charset from config (connection, server) for proper encoding
  • Integration tests for connection, server charset

@sidorares
Copy link
Owner

@sushantdhiman see also mysqljs/mysql#1408 (comment)

( and 488e0e6 )

@sushantdhiman
Copy link
Collaborator Author

sushantdhiman commented Aug 17, 2016

UTF8MB4_UNICODE_CI is working as expected. In previous tests table was created with latin1 and we tried to insert non-latin characters into it thus resulting ????

Although one problem remains, fieldName ( failing test ) are not parsed properly, to solve this I need to access charset and pass it to readLengthCodedString, where I can use iconv-lite to do the conversion if charset is passed. Main issue is packet data is accessed in some serial offset based protocol here, I need to access charset a few steps back here

@sidorares
Copy link
Owner

sidorares commented Aug 18, 2016

"some serial offset based protocol" - this actually helps here ( to overcome the fact that charset is further down packet and we need to calculate it's position before we start decoding schema/catalog/table/orgTable ) The lazy accessor is used to avoid unnecessary buffer->string conversion if schema/catalog/table/orgTable are not used

We still to use characterSet to decode name, so I suggest we'll have something like this:

  function readString(buffer, start, end, charset) {
     // add iconv charset conversion and respect charset parameter. Optimise in a way that it's likely called with utf8 most of the time
     return buffer.utf8Slice(start, end);
  }

  // name is always used, don't make it lazy
  // this.name = packet.readLengthCodedString();
  var _nameLength = packet.readLengthCodedNumber();
  var _nameStart = packet.offset;
  packet.offset += _nameLength;

  this._orgNameLength = packet.readLengthCodedNumber();
  this._orgNameStart = packet.offset;
  packet.offset += this._orgNameLength;

  packet.skip(1); //  length of the following fields (always 0x0c)
  this.characterSet = packet.readInt16();

  // now decode name once we have characterSet
  this.name = readString(this._buf, _nameStart, _nameStart + _nameLengt, this.characterSet);

  // modify getters code to respect charset as well

  var addString = function (name) {
  Object.defineProperty(ColumnDefinition.prototype, name, {get: function () {
    var start = this['_' + name + 'Start'];
    var end = start + this['_' + name + 'Length'];
    return readString(this._buf, start, end, this.characterSet);
  }});
};

@sushantdhiman
Copy link
Collaborator Author

Now I just need to prepare a proper map between charset codes and iconv-encodings, this one will need a bit of research work to see which charset belongs to which encoding

@sidorares
Copy link
Owner

sidorares commented Aug 18, 2016

we need to add conversion to column value itself (probably no need to convert numeric/date strings as they can only contain ascii):

return 'packet.readLengthCodedString()';

return 'packet.readLengthCodedString();';

Also maybe worth creating decoder only once in row parser? (might be tricky as we potentially need to have one decoder per field - not sure if overhead worth complexity, and this is just for non-utf8 users)

https://github.com/ashtuchkin/iconv-lite/blob/69a25dc19c67de1b022ccfafbc5e41c2114f3486/lib/index.js#L36

@sushantdhiman
Copy link
Collaborator Author

After testing with Wireshark and studying MySQL docs. Here are a few points to consider

MySQL doesn't support UTF8MB4 for table meta data (Their Docs)

Every meta data is internally stored in form of UTF8, thus this test will always fail on UTF8MB4. Which I think try to convert multibyte data to some UTF8 representation and comes up with ????

Using CLI [MySQL 5.5 , 'UTF8MB4']
mysql utf8mb4 multibyte select

Using CLI [MySQL 5.5 , 'UTF8']
utf8 cli

Apart from field meta data all selection will work fine. Everything is handled by MySQL. But this field meta data conversion is done at MySQL level we can't do anything about it.

Even if we connect with UTF8 charset and switch to UTF8MB4 later on this effect will remains, thus converting any multibyte data to ???? in all cases (for field meta data with utf8mb4)

I have modified said test so we use some proper field name to access the data.

Need your thoughts on this @sidorares , Unless MySQL doesn't start handling this case properly can we do anything about it ?

@sidorares
Copy link
Owner

@dougwilson - would be really great if you can add your opinion, I think you have much more experience on mysql side

@sushantdhiman I'll try to have close look tomorrow and give some feedback

@sidorares
Copy link
Owner

@sushantdhiman not sure if meta data (Their Docs) docs are still valid. What I see is when UTF8_GENERAL_CI is set SELECT "💩" query returns [ TextRow { '💩': '💩' } ] row, and both column name and column content transmitted as 4 byte utf8 character - 0xF0 0x9F 0x92 0xA9. When UTF8MB4_UNICODE_CI is used the column data is sent as same 4 bytes, but metadata (column name) is replaced with ? character.

@sushantdhiman
Copy link
Collaborator Author

Thats what bugs me too, isnt UTF8MB4 supposed to be a superset of UTF8, then why meta data is returned (converted) as ???

@sidorares
Copy link
Owner

I would expect opposite, this is how I read documentation

  • Mysql UTF8 is not actually UTF8, it's a subset of utf8 for code points encoded with up to 3 bytes
  • UTF8MB4 is "normal", standard utf8.
  • Field names still not allowed to contain chars outside of BMP ( 3 bytes encoding ), whether it's UTF8 or UTF8MB4

The reality seems to be very different

@sidorares
Copy link
Owner

then why meta data is returned (converted) as ???

This is kind of expected as per https://dev.mysql.com/doc/refman/5.6/en/charset-metadata.html (it's converted to ? because character is outside of BMP)

What I don't understand is why column name is not converted to ??? when UTF8_GENERAL_CI charset is selected

@sushantdhiman
Copy link
Collaborator Author

What I don't understand is why column name is not converted to ??? when UTF8_GENERAL_CI charset is selected

Exactly, why UTF8 is able to convert it but UTF8MB4 isnt. Its exact opposite if we consider UTF8MB4 is capable of handling 4 byte where UTF8 isnt.

May be in case of UTF8 everything is already in 3 bytes so that data fits easily, in case of UTF8MB4 they try to convert 4 byte representation to 3 bytes , resulting loss of data thus giving us ???

@sidorares
Copy link
Owner

May be in case of UTF8 everything is already in 3 bytes

No, I thought it might split it into 2 two bytes surrogate pairs but later realised that this is for utf-16 only. And bytes actually sent over the wire are 0xF0 0x9F 0x92 0xA9 for one single unicode character

@sidorares
Copy link
Owner

I'll try to test this against older versions of server

@sidorares
Copy link
Owner

@dougwilson
Copy link
Collaborator

Very odd indeed, though UTF-8 in MySQL is a mess in some regards... I have never personally even tried to use non-ASCII identifiers, partly because it's easier, and partly because I'm just English-centric :) It does seem odd that UTF8BM4 would not work in column metadata, though UTF8 would. If it helps as well, in MySQL, what is called "UTF8" is actually the "CESU-8" encoding and what is called "UTF8MB4" is actually the "UTF-8" encoding.

Both support characters outside the BMP just fine (not sure on metadata), though of course if the charset in MySQL is set to UTF8 then the driver will need to send CESU-8 encoded bytes rather than UTF-8 encoded bytes or non-BMP chars will get mangled. UTF8MB4 provides better support for non-BMP chars in that it understands them as one character instead of being made of two different characters (JavaScript, C#, and some other langs also consider non-BMP to be two characters).

@sidorares
Copy link
Owner

here is what I think is going on:

  1. with utf8_unicode_ci when character is outside of BMP server treats is like unknown 4 (or more) bytes stream. When received back on node side those 4 bytes assembled back into single codepoint. This applies to data and metadata. This is why field name chars are preserved.
  2. with utf8mb4_unicode_ci server is aware of characters outside of BMP and treats them as individual codepoints. For metadata, characters outside of BMP are replaced with ? ( 0x3f ) beacause of some internal restrictions

Example:

var connection = require('mysql2').createConnection({
  charset: 'utf8mb4_unicode_ci'
});
connection.query('SELECT CHAR_LENGTH("𝌆") ', (err, rows, fields) => console.log(rows, fields));

returns ( note that CHAR_LENGTH("𝌆") is 1 )

[ TextRow { 'CHAR_LENGTH("?")': 1 } ] [ { catalog: 'def',
    schema: '',
    name: 'CHAR_LENGTH("?")',
    orgName: '',
    table: '',
    orgTable: '',
    characterSet: 63,
    columnLength: 10,
    columnType: 8,
    flags: 129,
    decimals: 0 } ]

While same script, but with utf8_unicode_ci charset connection option results in

[ TextRow { 'CHAR_LENGTH("𝌆")': 4 } ] [ { catalog: 'def',
    schema: '',
    name: 'CHAR_LENGTH("𝌆")',
    orgName: '',
    table: '',
    orgTable: '',
    characterSet: 63,
    columnLength: 10,
    columnType: 8,
    flags: 129,
    decimals: 0 } ]

Note that CHAR_LENGTH returned is 4, same as LENGTH would return.

With this in mind utf8mb4_unicode_ci does look like a safer default option.

@sidorares
Copy link
Owner

@dougwilson

Both support characters outside the BMP just fine (not sure on metadata), though of course if the charset in MySQL is set to UTF8 then the driver will need to send CESU-8 encoded bytes rather than UTF-8 encoded bytes or non-BMP chars will get mangled

I don't think this is true, looks like when non-mb4 utf8 charset is selected and character is encoded using more than 3 octets server gets confused and just treats them as 4 or more characters

@dougwilson
Copy link
Collaborator

That is because you are not encoding properly over the wire. You need to encode in CESU-8 when using the UTF8 charset.

@sidorares
Copy link
Owner

You mean when the data is sent to server? Because when I receive it back it looks like it's normal utf8

@sidorares
Copy link
Owner

ashtuchkin/iconv-lite#102 :)

@dougwilson
Copy link
Collaborator

In CESU-8, a non-BMP character is encoded as six bytes, not four. This is because it is encoded as a pair of three byte sequences. CESU-8 is a dumb encoding used in the late 90s/early 00s and made it's way into various dbs like MySQL. They went back and defined a charaet that uses UTF-8 and called it UTF8MB4. But otherwise, when using the UTF8 charset, you need to use CESU-8 encoding over the wire.

@dougwilson
Copy link
Collaborator

lol, yea, that iconv-lite issue was for talking to an existing MySQL server that contained CESU-8 data :)

@sidorares
Copy link
Owner

sidorares commented Sep 3, 2016

what I see now: cesu8 works ok (rows and fields) for non-bmp chars if read and write by node client. mysql client fails to read them ( set names utf8 or set names utf8mb4 )

utf8mb4 works for data ( written by node client, read by mysql command line client ) nut not for metadata ( converted to ? server side )

Looks like only practical advise for new data: use UTF8MB4 for connection encoding but never use non-mb4 in field names. Old data: figure out what's there, if you have cesu-8 encoded chars connect with UTF8_GENERAL_CI otherwise connect as UTF8MB4_UNICODE_CI .

@sidorares
Copy link
Owner

@sushantdhiman I think this is ready. Maybe you can have one final look again just to be sure an I'll mere it (or you merge)

@sushantdhiman
Copy link
Collaborator Author

👏 👏 🎉

@sushantdhiman sushantdhiman merged commit b53f6e2 into master Sep 6, 2016
@sushantdhiman sushantdhiman deleted the fix-char-encoding branch September 6, 2016 05:44
@sushantdhiman
Copy link
Collaborator Author

@sidorares rc-12 is not on npm yet, right ?

@sidorares
Copy link
Owner

@sushantdhiman it is now, wanted tag to pass CI before publishing

I think this is last rc before 1.0.0 - this encoding stuff was the main reason I was holding release, hopefully after we'll have more frequent patch/major versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants