Skip to content

Metadata incompatibility between single-byte charsets and UTF8 charset #8614

Open
@arkinform

Description

@arkinform

In our application we have huge databases with thousands of tables in WIN1251 charset. Recently we decided to add full unicode support for our applications and we have encountered several "blocking" issues:

  1. Record size limit. We have a lot of tables with many varchar fields and in UTF8 they start to "reserve" 4x more space comparing with single-byte charset, and many tables can't be created in UTF8 database because of the record size limit. Hopefully it's going to be removed in 6.0 Remove/increase the record size limit #7332

  2. In UTF8 we can't use varchar with more than 8191 chars limit. We can argue a lot about how it can be replaced with blobs, refactored, etc. but the point is that in huge legacy systems it's very hard to implement such changes. Especially if you have hundreds on-premise installations with thousands custom reports, integrations, etc.

  3. Indexes on long varchars can't be created. For example, with single byte charset it's possible to create index on varchar(1024), but not with UTF8 charset. Of course, we are not going to search using full length, but we need to search "starting" and keep storing long strings, sometimes in healthcare systems it's really needed. Splitting into several fields or creating calculated indexes are not always feasible solutions. @dyemanov mentioned that there are some ideas how it can be "fixed". It would be useful to have a link to the task on GitHub.

So, let's focus on the main blocking issue which is still relevant in 6.0 - varchar limit 8091 in UTF8 charset. First obvious approach is to increase internal varchar byte limit for UTF8 charset keeping current 32k byte limit for single-byte charsets. It is the best solution which ensures metadata compatibility and equal char limits between different charsets.

If increasing internal varchar byte limit is not technically possible in the current state of Firebird project, then dynamic byte limit can be considered as a temporary measure until Firebird has such technical capabilities. For example, with dynamic limit you can store up to 32k latin characters in UTF8 varchar, but only 16k cyrillic characters in the same varchar, etc. In real practical cases this will increase effective varchar limit for most European languages with mixed latin characters usage from 8k to more than 16k chars. It's much more acceptable for Unicode migration in many cases and at least at the declaration level it makes database DDL completely interchangeable between single-byte charsets and UTF8 charset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions