Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding tests for 4 byte unicode characters #17978

Merged
merged 15 commits into from
Feb 7, 2017

Conversation

MorrisJobke
Copy link
Contributor

  • success on SQLite and Postgres
  • failure on MySQL due to the limited charset that only supports up to 3 bytes

Use case:

2015-07-30_10-55-52

People will name their folders/files with emojis. This then will cause errors like this on mysql:

An exception occurred while executing 'UPDATE `oc_filecache` SET `storage` = ?, `path` = ?, `path_hash` = ?, `name` = ?, `parent` =? WHERE `fileid` = ?' with params ["1", "files\/\ud83d\udca9.txt", "743923d586454a40e08523f558ae2955", "\ud83d\udca9.txt", "2", 3]: SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xF0\x9F\x92\xA9.t...' for column 'path' at row 1

cc @Raydiation @PVince81 @DeepDiver1975

ref #4513

@DeepDiver1975 DeepDiver1975 added this to the backlog milestone Jul 30, 2015
@DeepDiver1975
Copy link
Member

failure on MySQL due to the limited charset that only supports up to 3 bytes

as discussed on IRC: this is by design - nothing we can do.

Solution for apps: encode the data
Solution for filenames: 4byte chars are not allowed

@MorrisJobke
Copy link
Contributor Author

Solution for filenames: 4byte chars are not allowed

I don't think that this is a solution. I also don't like the "solution" to simply not allow : in filenames just because Windows can't handle it. It's just stupid to get driven by the easiest way. 🙈 🙊 🙉

If there is technically no way to do this on MySQL/MariaDB we should add a big warning on installation/ in the docs to not use MySQL. Better sooner than later.

But first of all I try to play around with char sets within MySQL.

@LukasReschke
Copy link
Member

That this works at all is another bug: #17981

@DeepDiver1975
Copy link
Member

I also don't like the "solution" to simply not allow : in filenames just because Windows can't handle it. It's just stupid to get driven by the easiest way. 🙈 🙊 🙉

This is not longer the case since 8.1 - #14346

@scrutinizer-notifier
Copy link

A new inspection was created.

@DeepDiver1975
Copy link
Member

HAHA ... you failed! @MorrisJobke You did not listen to me on IRC!
👎 👎 👎 👎 👎 👎 👎

@MorrisJobke
Copy link
Contributor Author

HAHA ... you failed! @MorrisJobke You did not listen to me on IRC!

Please be a bit more precise... I tried to address all your concerns. 😫

@DeepDiver1975
Copy link
Member

HAHA ... you failed! @MorrisJobke You did not listen to me on IRC!

Okay - I'm taking this one back .....

@MorrisJobke
Copy link
Contributor Author

We discussed this a lot and that was the result:

Pro

  • we can handle properly 4 byte characters in the DB
  • we can drop the limitation to not accept 4 byte characters in file names
  • apps doesn't need to take care of 4 byte character on their own, because the core handles it
  • we have a plan to migrate at some point in time to a proper DB encoding/collation in MySQL by default
  • MySQL doesn't have any disadvantage compared to Sqlite/Postgres anymore (both support already 4 byte charavters)

Cons

  • with the config option this is nearly untested/unused code paths that tend to be really critical once they get activated
  • we don't want to have yet another experimental feature switch
  • even more complex testing (more resources are needed)
  • switching this on by default is not an option - some web hoster instances could be get to a "not supported anymore" state because they cannot tweak their MySQL this much
  • we don't support 4 bytes character in every column (news, music app already have bug reports for this)

Current situation

  • we need to document this somewhere very visible - I can imagine that the usage of 4 byte characters (especially emojis) will increase (in filenames, titles, ...) and then having an instance where this is simply not possible by design is really bad

@butonic
Copy link
Member

butonic commented Jul 30, 2015

Hmmm, what happens when my instance supports 4byte chars (because I may be using postgres) and I share a folder to someone who is running an oc instance that does not have this activated?

@BernhardPosselt
Copy link
Contributor

What does this mean? Will this fix handle the issue in news and music?

we don't support 4 bytes character in every column (news, music app already have bug reports for this)

@MorrisJobke
Copy link
Contributor Author

What does this mean? Will this fix handle the issue in news and music?

It would if the admin sets up mysql correctly (see sample.config.php) and enables mysql.utf8mb4 in the config.php.

@DeepDiver1975
Copy link
Member

rebased

@MorrisJobke I'm willing to accept this in case we can come up with an autotest.sh firing up a proper docker

@MorrisJobke MorrisJobke self-assigned this Nov 12, 2015
@DeepDiver1975 DeepDiver1975 mentioned this pull request Nov 16, 2015
28 tasks
@karlitschek
Copy link
Contributor

Do we have a gut feeling how most mysql servers out there are configured by default?

@DeepDiver1975 DeepDiver1975 modified the milestones: 9.0-current, backlog Nov 19, 2015
@DeepDiver1975
Copy link
Member

@MorrisJobke any time available on your side to take care about the docker image and autotest.sh ??

just let me know - THX

@ghost ghost mentioned this pull request Dec 30, 2015
@tschloeter
Copy link

I followed these multibyte UTF-8 issues for a while now and I am still interested in the solution. As I see, some people have worked on that. Some time ago, someone suggested a new config parameter that told OC that the MySQL server is configured to use utf8mb4 which I would appreciate.

Anyway, at the moment I see so many open and closed issues on the topic that I don't understand what the status is.

Could someone summarize whether the issue is fixed in the most recent version or if it is planned to fix in near future? I am still stuck on my dirty-hack patched 8.0 installation and would like to update to 8.2.x.

If there is no fix available: Is it easily possible to migrate from MySQL to Postgres? In the manual I only found how to migrate from SQLite.

Thanks in advance.

Thomas

@ghost
Copy link

ghost commented Jan 26, 2016

I would also be interested in a summary of this issue, I get the impression that this will be resolved in version 9?

Would moving to PostgreSQL be a good interim solution?

@PVince81
Copy link
Contributor

PVince81 commented Feb 3, 2017

I have added ROW_FORMAT and the command runs now:

Change collation for oc_addressbookchanges ...
Change collation for oc_addressbooks ...
Change collation for oc_appconfig ...
Change collation for oc_authtoken ...
Change collation for oc_calendarchanges ...
Change collation for oc_calendarobjects ...
Change collation for oc_calendars ...
Change collation for oc_calendarsubscriptions ...
Change collation for oc_cards ...
Change collation for oc_cards_properties ...
Change collation for oc_comments ...
Change collation for oc_comments_read_markers ...
Change collation for oc_credentials ...
Change collation for oc_dav_shares ...
Change collation for oc_external_applicable ...
Change collation for oc_external_config ...
Change collation for oc_external_mounts ...
Change collation for oc_external_options ...
Change collation for oc_file_locks ...
Change collation for oc_filecache ...
Change collation for oc_files_trash ...
Change collation for oc_group_admin ...
Change collation for oc_group_user ...
Change collation for oc_groups ...
Change collation for oc_jobs ...
Change collation for oc_migrations ...
Change collation for oc_mimetypes ...
Change collation for oc_mounts ...
Change collation for oc_preferences ...
Change collation for oc_privatedata ...
Change collation for oc_properties ...
Change collation for oc_schedulingobjects ...
Change collation for oc_share ...
Change collation for oc_share_external ...
Change collation for oc_storages ...
Change collation for oc_systemtag ...
Change collation for oc_systemtag_group ...
Change collation for oc_systemtag_object_mapping ...
Change collation for oc_trusted_servers ...
Change collation for oc_users ...
Change collation for oc_vcategory ...
Change collation for oc_vcategory_to_object ...

The result is that the schema is not 100% the same.

  • oc_federated_reshares still has ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;, seems the table was omitted from the list
  • oc_migrations still has collation ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

All other tables look fine

I suspect that the select * from information_schema is somehow omitting oc_federated_shares because that table has no string columns, only int. But for the sake of DB schema consistency, we should also convert that one.

@PVince81
Copy link
Contributor

PVince81 commented Feb 3, 2017

So the oc_federated_reshares column have no collation info since they are no strings:

MariaDB [owncloud]> select * from information_schema.columns where table_name='oc_federated_reshares' and table_schema='owncloud';
+---------------+--------------+-----------------------+-------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+---------------+--------------------+--------------------+----------------+-------------+------------+-------+---------------------------------+-------------------------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME            | COLUMN_NAME | ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | DATA_TYPE | CHARACTER_MAXIMUM_LENGTH | CHARACTER_OCTET_LENGTH | NUMERIC_PRECISION | NUMERIC_SCALE | DATETIME_PRECISION | CHARACTER_SET_NAME | COLLATION_NAME | COLUMN_TYPE | COLUMN_KEY | EXTRA | PRIVILEGES                      | COLUMN_COMMENT                |
+---------------+--------------+-----------------------+-------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+---------------+--------------------+--------------------+----------------+-------------+------------+-------+---------------------------------+-------------------------------+
| def           | owncloud     | oc_federated_reshares | share_id    |                1 | NULL           | NO          | int       |                     NULL |                   NULL |                10 |             0 |               NULL | NULL               | NULL           | int(11)     | PRI        |       | select,insert,update,references |                               |
| def           | owncloud     | oc_federated_reshares | remote_id   |                2 | NULL           | NO          | int       |                     NULL |                   NULL |                10 |             0 |               NULL | NULL               | NULL           | int(11)     |            |       | select,insert,update,references | share ID at the remote server |
+---------------+--------------+-----------------------+-------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+---------------+--------------------+--------------------+----------------+-------------+------------+-------+---------------------------------+-------------------------------+

We should improve the query to also select from the information_schema.tables:

MariaDB [owncloud]> select * from information_schema.tables where table_name='oc_federated_reshares' and table_schema='owncloud';
+---------------+--------------+-----------------------+------------+--------+---------+------------+------------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME            | TABLE_TYPE | ENGINE | VERSION | ROW_FORMAT | TABLE_ROWS | AVG_ROW_LENGTH | DATA_LENGTH | MAX_DATA_LENGTH | INDEX_LENGTH | DATA_FREE | AUTO_INCREMENT | CREATE_TIME         | UPDATE_TIME | CHECK_TIME | TABLE_COLLATION | CHECKSUM | CREATE_OPTIONS | TABLE_COMMENT |
+---------------+--------------+-----------------------+------------+--------+---------+------------+------------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------------+
| def           | owncloud     | oc_federated_reshares | BASE TABLE | InnoDB |      10 | Compact    |          0 |              0 |       16384 |               0 |            0 |         0 |           NULL | 2017-02-03 18:44:34 | NULL        | NULL       | utf8_bin        |     NULL |                |               |
+---------------+--------------+-----------------------+------------+--------+---------+------------+------------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------------+

@PVince81
Copy link
Contributor

PVince81 commented Feb 3, 2017

Hmm, this doesn't do anything:

alter table oc_federated_reshares convert to character set utf8mb4 collate utf8mb4_bin;

The table keeps the wrong collation.
While it might be fine short term, if someone tries to add new string columns to it there is a slight risk that they will have the wrong collation.

@PVince81
Copy link
Contributor

PVince81 commented Feb 3, 2017

In general I'm fine merging this first, but we need to address the conversion issue directly afterwards.

@DeepDiver1975
Copy link
Member

The table keeps the wrong collation.
While it might be fine short term, if someone tries to add new string columns to it there is a slight risk that they will have the wrong collation.

my assumption is that as soon as a new column is added the collation will be adopted - but we better test this

@DeepDiver1975 DeepDiver1975 merged commit a7d1121 into master Feb 7, 2017
@DeepDiver1975 DeepDiver1975 deleted the testing-characters-in-db branch February 7, 2017 12:57
@PVince81
Copy link
Contributor

PVince81 commented Feb 7, 2017

my assumption is that as soon as a new column is added the collation will be adopted - but we better test this

@DeepDiver1975 did you test it and confirmed that it works like this ?

@DeepDiver1975
Copy link
Member

did you test it and confirmed that it works like this ?

not yet

@davitol
Copy link
Contributor

davitol commented Mar 7, 2017

1. Set a emoji as comment
2. Refresh the screen and look up the comment

Actual Behaviour:
The comment is not shown

oC Version: 10.0 Master Branch
BBDD: mariadb

Note: I knew later that it was needed to add some parameters to my.cnf file in order to let emojis

@lock
Copy link

lock bot commented Aug 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants