-
-
Notifications
You must be signed in to change notification settings - Fork 631
Fixes #302 #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #302 #374
Conversation
@sushantdhiman see also mysqljs/mysql#1408 (comment) ( and 488e0e6 ) |
Although one problem remains, |
"some serial offset based protocol" - this actually helps here ( to overcome the fact that charset is further down packet and we need to calculate it's position before we start decoding schema/catalog/table/orgTable ) The lazy accessor is used to avoid unnecessary buffer->string conversion if schema/catalog/table/orgTable are not used We still to use function readString(buffer, start, end, charset) {
// add iconv charset conversion and respect charset parameter. Optimise in a way that it's likely called with utf8 most of the time
return buffer.utf8Slice(start, end);
}
// name is always used, don't make it lazy
// this.name = packet.readLengthCodedString();
var _nameLength = packet.readLengthCodedNumber();
var _nameStart = packet.offset;
packet.offset += _nameLength;
this._orgNameLength = packet.readLengthCodedNumber();
this._orgNameStart = packet.offset;
packet.offset += this._orgNameLength;
packet.skip(1); // length of the following fields (always 0x0c)
this.characterSet = packet.readInt16();
// now decode name once we have characterSet
this.name = readString(this._buf, _nameStart, _nameStart + _nameLengt, this.characterSet);
// modify getters code to respect charset as well
var addString = function (name) {
Object.defineProperty(ColumnDefinition.prototype, name, {get: function () {
var start = this['_' + name + 'Start'];
var end = start + this['_' + name + 'Length'];
return readString(this._buf, start, end, this.characterSet);
}});
}; |
Now I just need to prepare a proper map between charset codes and iconv-encodings, this one will need a bit of research work to see which charset belongs to which encoding |
we need to add conversion to column value itself (probably no need to convert numeric/date strings as they can only contain ascii): node-mysql2/lib/compile_text_parser.js Line 149 in b2010f9
node-mysql2/lib/compile_binary_parser.js Line 153 in b2010f9
Also maybe worth creating decoder only once in row parser? (might be tricky as we potentially need to have one decoder per field - not sure if overhead worth complexity, and this is just for non-utf8 users) |
c0a8535
to
98800c5
Compare
After testing with Wireshark and studying MySQL docs. Here are a few points to consider MySQL doesn't support Every meta data is internally stored in form of Using CLI [MySQL 5.5 , 'UTF8MB4'] Using CLI [MySQL 5.5 , 'UTF8'] Apart from field meta data all selection will work fine. Everything is handled by MySQL. But this field meta data conversion is done at MySQL level we can't do anything about it. Even if we connect with I have modified said test so we use some proper field name to access the data. Need your thoughts on this @sidorares , Unless MySQL doesn't start handling this case properly can we do anything about it ? |
@dougwilson - would be really great if you can add your opinion, I think you have much more experience on mysql side @sushantdhiman I'll try to have close look tomorrow and give some feedback |
@sushantdhiman not sure if meta data (Their Docs) docs are still valid. What I see is when UTF8_GENERAL_CI is set |
Thats what bugs me too, isnt |
I would expect opposite, this is how I read documentation
The reality seems to be very different |
This is kind of expected as per https://dev.mysql.com/doc/refman/5.6/en/charset-metadata.html (it's converted to ? because character is outside of BMP) What I don't understand is why column name is not converted to ??? when UTF8_GENERAL_CI charset is selected |
Exactly, why UTF8 is able to convert it but UTF8MB4 isnt. Its exact opposite if we consider UTF8MB4 is capable of handling 4 byte where UTF8 isnt. May be in case of UTF8 everything is already in 3 bytes so that data fits easily, in case of UTF8MB4 they try to convert 4 byte representation to 3 bytes , resulting loss of data thus giving us |
No, I thought it might split it into 2 two bytes surrogate pairs but later realised that this is for utf-16 only. And bytes actually sent over the wire are 0xF0 0x9F 0x92 0xA9 for one single unicode character |
I'll try to test this against older versions of server |
iconv-lite encodings list: https://github.com/ashtuchkin/iconv-lite/wiki/Supported-Encodings |
Very odd indeed, though UTF-8 in MySQL is a mess in some regards... I have never personally even tried to use non-ASCII identifiers, partly because it's easier, and partly because I'm just English-centric :) It does seem odd that UTF8BM4 would not work in column metadata, though UTF8 would. If it helps as well, in MySQL, what is called "UTF8" is actually the "CESU-8" encoding and what is called "UTF8MB4" is actually the "UTF-8" encoding. Both support characters outside the BMP just fine (not sure on metadata), though of course if the charset in MySQL is set to UTF8 then the driver will need to send CESU-8 encoded bytes rather than UTF-8 encoded bytes or non-BMP chars will get mangled. UTF8MB4 provides better support for non-BMP chars in that it understands them as one character instead of being made of two different characters (JavaScript, C#, and some other langs also consider non-BMP to be two characters). |
here is what I think is going on:
Example: var connection = require('mysql2').createConnection({
charset: 'utf8mb4_unicode_ci'
});
connection.query('SELECT CHAR_LENGTH("𝌆") ', (err, rows, fields) => console.log(rows, fields)); returns ( note that CHAR_LENGTH("𝌆") is 1 )
While same script, but with
Note that CHAR_LENGTH returned is 4, same as LENGTH would return. With this in mind |
I don't think this is true, looks like when non-mb4 utf8 charset is selected and character is encoded using more than 3 octets server gets confused and just treats them as 4 or more characters |
That is because you are not encoding properly over the wire. You need to encode in CESU-8 when using the UTF8 charset. |
You mean when the data is sent to server? Because when I receive it back it looks like it's normal utf8 |
In CESU-8, a non-BMP character is encoded as six bytes, not four. This is because it is encoded as a pair of three byte sequences. CESU-8 is a dumb encoding used in the late 90s/early 00s and made it's way into various dbs like MySQL. They went back and defined a charaet that uses UTF-8 and called it UTF8MB4. But otherwise, when using the UTF8 charset, you need to use CESU-8 encoding over the wire. |
lol, yea, that iconv-lite issue was for talking to an existing MySQL server that contained CESU-8 data :) |
what I see now: cesu8 works ok (rows and fields) for non-bmp chars if read and write by node client. mysql client fails to read them ( utf8mb4 works for data ( written by node client, read by mysql command line client ) nut not for metadata ( converted to ? server side ) Looks like only practical advise for new data: use UTF8MB4 for connection encoding but never use non-mb4 in field names. Old data: figure out what's there, if you have cesu-8 encoded chars connect with UTF8_GENERAL_CI otherwise connect as UTF8MB4_UNICODE_CI . |
@sushantdhiman I think this is ready. Maybe you can have one final look again just to be sure an I'll mere it (or you merge) |
👏 👏 🎉 |
@sidorares |
@sushantdhiman it is now, wanted tag to pass CI before publishing I think this is last rc before 1.0.0 - this encoding stuff was the main reason I was holding release, hopefully after we'll have more frequent patch/major versions |
Tasks
charset
and installiconv-lite
INSERT INTO
and fixing themcesu8
)