Skip to content

使用pgloader将mysql迁移至pgsql并使用mysql_fdw校验表数据 #48

Open
@n0099

Description

  1. 环境
$ uname -a
Linux azure 6.5.0-1018-azure #19~22.04.2-Ubuntu SMP Thu Mar 21 16:45:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ mysqld --version
/usr/sbin/mysqld  Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))
$ /usr/lib/postgresql/16/bin/postgres --version
postgres (PostgreSQL) 16.2 (Ubuntu 16.2-1.pgdg22.04+1)

pgloader@3.6.10
mysql_fdw@2.9.1
库名假定为tbm

  1. 使用某11年历史的commonlisp程序 https://github.com/dimitri/pgloader 可以在多个数据源(其中有着mysql和pgsql)之间导入导出

1.1 由于 dimitri/pgloader#1211 ,从postgre社区apt source安装的pgloader 3.6.10-2.pgdg22.04+1无法连接mysqld

$ apt show -a pgloader
Package: pgloader
Version: 3.6.10-2.pgdg22.04+1
Priority: optional
Section: database
Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Installed-Size: 33.9 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libsqlite3-0, libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.3 MB
APT-Sources: https://apt.postgresql.org/pub/repos/apt jammy-pgdg/main amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.

Package: pgloader
Version: 3.6.3-1ubuntu1
Priority: extra
Section: universe/misc
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 33.8 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.5 MB
APT-Sources: http://azure.archive.ubuntu.com/ubuntu jammy/universe amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.

1.1.1. 建议从源码自行编译3.6.10

sudo apt install -y pgloader sbcl && # 通过安装apt中的pgloader来安装其运行时所需依赖和sbcl commonlisp编译器
git clone https://github.com/dimitri/pgloader &&
cd pgloader &&
git checkout v3.6.10 &&
make DYNSIZE=1024 && # 内存限制
build/bin/pgloader --version && # pgloader version "3.6.af8c3c1" compiled with SBCL 2.1.11.debian
sudo apt purge -y sbcl

1.1.2. 清理:

sudo apt purge -y pgloader &&
sudo apt autoremove --purge

1.2. 由于 dimitri/pgloader#782 ,pgloader只能通过仍在使用mysql 8.0.34中deprecated的mysql_native_password用户登录mysqld

1.2.1. 建议ad-hoc地临时创建一个最小权限(只GRANT库/表级权限)用户用于pgloader而非修改现有用户的default auth plugin

  1. 令文件schema.load
LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

WITH quote identifiers, schema only

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;

2.1. 可以使用INCLUDING ONLY TABLE NAMES MATCHING ~/regexp/, 'exact'来filter要导入的表名

2.2. 令文件data.load

LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

 SET MySQL PARAMETERS -- https://github.com/dimitri/pgloader/issues/999
     net_read_timeout  = '300',
     net_write_timeout = '300'

WITH quote identifiers, data only, on error stop, workers = 2, concurrency = 1, batch rows = 1000, batch size = 8MB, prefetch rows = 10000 -- 酌情tunning https://pgloader.readthedocs.io/en/latest/command.html#batch-behaviour-options

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;

2.3. 分开导入表结构表数据是因为pgloader默认行为是按照指定和mysql://源下默认的CASTING创建目的表结构(除了索引),再COPY导入表数据,最后并行{表索引数量}CREATE INDEX并且不等待其完成而是并行地开始下一个表的过程: https://pgloader.readthedocs.io/en/latest/batches.html

By default, as many CREATE INDEX threads as the maximum number of indexes per table are found in your source schema. It is possible to set the max parallel create index WITH option to another number in case there’s just too many of them to create.

由于未知原因在3.6.af8c3c1WITH max parallel create index 1并没有实际作用仍然会执行上述流程所以不得不分开在WITH schema only.load中先完成CREATE INDEX dimitri/pgloader#1576 以减少io争用

2.3.1. 出于类似的原因也建议先禁用autovacuum

ALTER SYSTEM SET autovacuum = off;
SELECT pg_reload_conf();
SELECT * FROM pg_settings WHERE name = 'autovacuum';

清理:待pgloader完成后再手动VACUMM FULL https://dba.stackexchange.com/questions/130496/is-it-worth-it-to-run-vacuum-on-a-table-that-only-receives-inserts/130514#130514

ALTER SYSTEM RESET autovacuum;
SELECT pg_reload_conf();
VACUMM FULL;

2.4. WITH quote identifiers是为了方便后续校验两端表数据是否一致时可以假定列名相同,您可以之后再重命名为postgre社区习惯的snake_case以减少满屏幕""(相当于mysql的``

2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql

SELECT * FROM pg_stat_activity WHERE application_name = 'pgloader';

可根据表大小猜测进度 https://stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505

SELECT table_schema, table_name, pg_relation_size('"'||table_schema||'"."'||table_name||'"')::float8 / 1024 / 1024 "MiB"
FROM information_schema.tables
WHERE table_schema = 'tbm'
ORDER BY 3 DESC;
  1. 设有表tbmc_f97650_thread结构
    mysql端:
CREATE TABLE `tbmc_f97650_thread` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `tid` bigint unsigned NOT NULL,
  `threadType` bigint unsigned NOT NULL,
  `stickyType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `topicType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `isGood` tinyint DEFAULT NULL,
  `title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `authorUid` bigint NOT NULL,
  `postedAt` int unsigned NOT NULL,
  `latestReplyPostedAt` int unsigned NOT NULL,
  `latestReplierUid` bigint DEFAULT NULL,
  `replyCount` int unsigned DEFAULT NULL,
  `viewCount` int unsigned DEFAULT NULL,
  `shareCount` int unsigned DEFAULT NULL,
  `agreeCount` int DEFAULT NULL,
  `disagreeCount` int DEFAULT NULL,
  `zan` blob,
  `geolocation` blob,
  `authorPhoneType` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `createdAt` int unsigned NOT NULL,
  `updatedAt` int unsigned DEFAULT NULL,
  `lastSeenAt` int unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `tid` (`tid`) USING BTREE,
  KEY `authorUid` (`authorUid`),
  KEY `postTime` (`postedAt` DESC)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm_dev.tbmc_f97650_thread (
  id bigint primary key not null default nextval('tbmc_f97650_thread_id_seq'::regclass),
  tid bigint not null,
  "threadType" bigint not null,
  "stickyType" text,
  "topicType" text,
  "isGood" smallint,
  title text not null,
  "authorUid" bigint not null,
  "postedAt" bigint not null,
  "latestReplyPostedAt" bigint not null,
  "latestReplierUid" bigint,
  "replyCount" bigint,
  "viewCount" bigint,
  "shareCount" bigint,
  "agreeCount" integer,
  "disagreeCount" integer,
  zan bytea,
  geolocation bytea,
  "authorPhoneType" text,
  "createdAt" bigint not null,
  "updatedAt" bigint,
  "lastSeenAt" bigint
);
create index "idx_23173_authorUid" on tbmc_f97650_thread using btree ("authorUid");
create index "idx_23173_postTime" on tbmc_f97650_thread using btree ("postedAt");
create unique index idx_23173_tid on tbmc_f97650_thread using btree (tid);

3.1. 可以通过将表中每列值concat起来hash后再aggreate地对每行hash再hash来得出整个表的hash
mermaid.live

flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --CONCAT_WS(' ', ...)--> row[row\nsingle long csv line string but using space and without any escape]
    row --sha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash
Loading

mysql

SET SESSION group_concat_max_len = 18446744073709551615; -- 默认只有1024超过会带WARNING截断 https://dba.stackexchange.com/questions/197746/why-set-group-concat-max-len-below-the-maximum
SELECT SHA2(GROUP_CONCAT(SHA2(CONCAT_WS(' ',
    tid, threadType, stickyType, topicType, isGood, title, authorUid, postedAt, latestReplyPostedAt, latestReplierUid, replyCount, viewCount, shareCount, agreeCount, disagreeCount, LOWER(HEX(zan)), LOWER(HEX(geolocation)), authorPhoneType, createdAt, updatedAt, lastSeenAt
), 512) ORDER BY tid SEPARATOR '\n'), 512) hash FROM tbmc_f97650_thread;

blob类型字段必须套LOWER(HEX())因为mysql默认的blob2text输出类似0xDEADBEAF形式而pgsql是deadbeaf
pgsql

SELECT encode(sha512(convert_to(array_to_string(array_agg(encode(sha512(convert_to(concat_ws(' ',
    tid, "threadType", "stickyType", "topicType", "isGood", title, "authorUid", "postedAt", "latestReplyPostedAt", "latestReplierUid", "replyCount", "viewCount", "shareCount", "agreeCount", "disagreeCount", encode(zan, 'hex'), encode(geolocation, 'hex'), "authorPhoneType", "createdAt", "updatedAt", "lastSeenAt"
), 'utf8')), 'hex') ORDER BY tid), E'\n'), 'utf8')), 'hex') hash FROM tbmc_f97650_thread;

3.2. 然而在大表上两者最终都会耗尽内存而失败,并且需要

  • 显式列出所有列名
  • 给mysql表的blob列套LOWER(HEX())
  • GROUP_CONCAT()时按照PKUK排序行否则在pgsql中默认迫真假随机
  • 导致难以针对100+个表一键生成

3.3. 理论上可以通过赋予纯SQL(无需臭名昭著的命令式存储过程或外部程序辅助)图灵完备性 https://wiki.postgresql.org/wiki/Mandelbrot_set 导致其踏入了同一条Turing Tarpit之河流的rCTELIMIT 10000地每次hash的行并对每10k行aggreate之hash再aggreate地hash从而减少每行hash所aggreate的字符串长度
mermaid.live

flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT rest\nCONCAT_WS(' ', ...)--> row
    row[row\nsingle long csv line string but using space and without any escape]
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT rest\nsha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash
Loading
  1. https://stackoverflow.com/questions/59715303/is-there-any-way-join-tables-of-mysql-and-postgresql/59721228#59721228某so纯路人指出可以使用postgre社区牛逼哄哄的某知名企业EnterpriseDB所开发的某pgsql扩展 https://github.com/EnterpriseDB/mysql_fdw 通过foreign data wrapper来在pgsql中将外部数据当做迫真物化视图查询

4.1. 安装:

git clone https://github.com/EnterpriseDB/mysql_fdw &&
git checkout REL-2_9_1 &&
sudo apt install -y libmysqlclient-dev postgresql-server-dev-16 &&
export PATH=/usr/local/pgsql/bin/:$PATH &&
make USE_PGXS=1 &&
sudo make USE_PGXS=1 install &&
make USE_PGXS=1 installcheck &&
sudo apt purge -y postgresql-server-dev-16

清理:

sudo apt purge -y libmysqlclient-dev &&
sudo make USE_PGXS=1 uninstall &&
rm -rf mysql_fdw

4.2. 安装该pgsql扩展

ALTER USER tbm WITH SUPERUSER;
CREATE EXTENSION IF NOT EXISTS mysql_fdw;
ALTER USER tbm WITH NOSUPERUSER;
CREATE SERVER IF NOT EXISTS mysql
    FOREIGN DATA WRAPPER mysql_fdw
    OPTIONS (host '127.0.0.1', port '3306');
CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER -- 当前登录用户 也可以用PUBLIC使得用户名/密码对所有pgsql用户可见
    SERVER mysql
    OPTIONS (username 'username', password 'password');

清理:

DROP USER MAPPING IF EXISTS FOR CURRENT_USER SERVER mysql; -- FOR CURRENT_USER同上
DROP SCHEMA IF EXISTS mysql CASCADE;
DROP SERVER IF EXISTS mysql CASCADE;
DROP EXTENSION IF EXISTS mysql_fdw CASCADE;
DROP DATABASE contrib_regression; -- make USE_PGXS=1 installcheck所产生

4.3. 在pgsql中创建fdw表

CREATE SCHEMA IF NOT EXISTS mysql;
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmc_f97650_thread"()
    INHERITS (tbm."tbmc_f97650_thread")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmc_f97650_thread');

使用table inheritance是为了避免在2.已经pgloader schema.load后再写一遍fdw表结构,但这会导致

4.3.1. 被继承表tbmc_f97650_thread无法再直接UPDATE/DELETE返回[HV005] ERROR: system attribute "tableoid" can't be fetched from remote relation而必须加ONLY EnterpriseDB/mysql_fdw#300 但仍然可以INSERT

4.3.1.1. 这个bug可能会在未来版本的mysql_fdw中修复但在这里并不重要,因为这种迫真假锁全表正好避免了在校验期间UPDATE/DELETE(但防不了INSERT)(默认事务隔离级别READ COMMITTED而非mysql默认的REPEATABLE READ #32 (comment) 当然您也可以改)了pgsql表导致假阴性

4.3.2. 由于table inheritance的本质:

SELECT * FROM tbm."tbmc_f97650_thread"

也会出现

SELECT * FROM mysql."tbmc_f97650_thread"

中的所有结果:也就是被继承表并集了所有继承表(在这里通过查询mysql),相当于

SELECT * FROM ONLY tbm."tbmc_f97650_thread" UNION ALL SELECT * FROM mysql."tbmc_f97650_thread"
  1. 设有表tbmi_ocr_box_Latn结构:
    mysql端:
CREATE TABLE `tbmi_ocr_box_Latn` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `centerPointX` smallint unsigned NOT NULL,
  `centerPointY` smallint unsigned NOT NULL,
  `width` smallint unsigned NOT NULL,
  `height` smallint unsigned NOT NULL,
  `rotationDegrees` smallint unsigned NOT NULL,
  `recognizer` enum('TesseractHorizontal','TesseractVertical','PaddleOCRv4','PaddleOCRv3') NOT NULL,
  `confidence` tinyint unsigned NOT NULL,
  `text` text NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`,`centerPointX`,`centerPointY`,`width`,`height`,`rotationDegrees`,`recognizer`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm."tbmi_ocr_box_Latn" (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "centerPointX" integer not null,
  "centerPointY" integer not null,
  width integer not null,
  height integer not null,
  "rotationDegrees" integer not null,
  recognizer "tbmi_ocr_box_Latn_recognizer" not null,
  confidence smallint not null,
  text text not null,
  primary key ("imageId", "frameIndex", "centerPointX", "centerPointY", width, height, "rotationDegrees", recognizer)
);

5.1. 根据某so纯路人symmdifftable有4种写法:

5.1.1. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/15333054#15333054

-- symmetric difference FULL OUTER JOIN USING (...columns)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b
  USING ("imageId", "frameIndex", "centerPointX", "centerPointY", "width", "height", "rotationDegrees", "recognizer", "confidence", "text")
WHERE (a."imageId" IS NULL AND a."frameIndex" IS NULL AND a."centerPointX" IS NULL AND a."centerPointY" IS NULL AND a."width" IS NULL AND a."height" IS NULL AND a."rotationDegrees" IS NULL AND a."recognizer" IS NULL AND a."confidence" IS NULL AND a."text" IS NULL)
   OR (b."imageId" IS NULL AND b."frameIndex" IS NULL AND b."centerPointX" IS NULL AND b."centerPointY" IS NULL AND b."width" IS NULL AND b."height" IS NULL AND b."rotationDegrees" IS NULL AND b."recognizer" IS NULL AND b."confidence" IS NULL AND b."text" IS NULL);
1 row retrieved starting from 1 in 5 s 745 ms (execution: 5 s 698 ms, fetching: 47 ms)

5.1.1.1. 一键生成

SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b \n  USING ("' ||
    string_agg(column_name, '", "' ORDER BY ordinal_position) ||
    E'")\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbmi_ocr_box_Latn'
GROUP BY table_name;

5.1.1.2. 也可以使用NATURAL FULL OUTER JOIN ONLY https://www.postgresql.org/docs/current/queries-table-expressions.html 省略USING (...)

Finally, NATURAL is a shorthand form of USING: it forms a USING list consisting of all column names that appear in both input tables. As with USING, these columns appear only once in the output table. If there are no common column names, NATURAL JOIN behaves like JOIN ... ON TRUE, producing a cross-product join.

5.1.2. 但5.1.1.5.1.1.2.并不能用于有列类型允许NULL之存在的表如7.中的tbm_bilibiliVote表有着

[...]
  `authorExpGrade` int DEFAULT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
[...]

因为根据sql特色之 https://en.wikipedia.org/wiki/Three-valued_logic NULL = NULLUNKNOWNUNKNOWNboolfalse于是不会被ON/WHERE等clause视作相同 https://stackoverflow.com/questions/14366004/sql-server-join-missing-null-values

5.1.2.1. 因而需要将5.1.1.中的USING (...columns)展开成对所有列=operator以及额外的OR (IS NULL AND IS NULL)

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    (a."pid" = b."pid") AND
    (a."authorUid" = b."authorUid") AND
    (a."authorExpGrade" = b."authorExpGrade" OR (a."authorExpGrade" IS NULL AND b."authorExpGrade" IS NULL)) AND
    (a."isValid" = b."isValid") AND
    (a."voteBy" = b."voteBy" OR (a."voteBy" IS NULL AND b."voteBy" IS NULL)) AND
    (a."voteFor" = b."voteFor" OR (a."voteFor" IS NULL AND b."voteFor" IS NULL)) AND
    (a."replyContent" = b."replyContent") AND
    (a."postTime" = b."postTime")
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);

5.1.2.1.1. 不对NOT NULL的列做额外判断是为了避免7.1.中的

FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

5.1.2.2. 一键生成

SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b ON\n' ||
    string_agg(
        '    (a."' ||
        column_name ||
        '" = b."' ||
        column_name ||
        CASE WHEN is_nullable::bool = true THEN
            '" OR (a."' ||
            column_name ||
            '" IS NULL AND b."' ||
            column_name ||
            '" IS NULL))'
        ELSE '")'
        END,
        E' AND\n' ORDER BY ordinal_position
    ) ||
    E'\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbm_bilibiliVote'
GROUP BY table_name;

5.1.3. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/49381589#49381589

-- symmetric difference FULL OUTER JOIN row(table)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
1 row retrieved starting from 1 in 6 s 464 ms (execution: 6 s 345 ms, fetching: 119 ms)

5.1.3.1. 其中row(t)https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

5.1.3.2. 这也避免了5.1.2.

5.1.4. https://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables

-- non symmetric difference EXCEPT mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b) t;
1 row retrieved starting from 1 in 5 s 137 ms (execution: 4 s 903 ms, fetching: 234 ms)

由于EXCEPT不是symmdiffFULL OUTER JOIN而是SELECT a.* FROM a LEFT OUTER JOIN b所以只能执行两遍2x耗时正如同没有FULL OUTER JOIN可用的mysql人 https://stackoverflow.com/questions/4796872/how-can-i-do-a-full-outer-join-in-mysql

-- non symmetric difference EXCEPT tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b) t;
1 row retrieved starting from 1 in 5 s 793 ms (execution: 5 s 527 ms, fetching: 266 ms)

5.1.4.1. 由于未知bug

-- symmetric difference EXCEPT
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b
) t;

UNION ALL后始终为0行即便单独查询有行

1 row retrieved starting from 1 in 8 s 678 ms (execution: 8 s 595 ms, fetching: 83 ms)

5.1.4.2. 这也避免了5.1.2.

5.1.5. 典型的误用NOT EXISTS

-- non symmetric difference WHERE NOT EXISTS mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)) t;
1 row retrieved starting from 1 in 764 ms (execution: 250 ms, fetching: 514 ms)

因为SELECT * FROM mysql."tbmi_ocr_box_Latn"在此显然返回非0行(因而只需要读一行mysql."tbmi_ocr_box_Latn"几百ms后就会立即返回)从而使得该predicate在非空表上恒真

-- non symmetric difference WHERE NOT EXISTS tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)) t;
1 row retrieved starting from 1 in 516 ms (execution: 125 ms, fetching: 391 ms)

5.1.5.1. 由于未知bug

-- symmetric difference WHERE NOT EXISTS
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)
) t;

UNION ALL后始终为0行即便单独查询有非0行

1 row retrieved starting from 1 in 776 ms (execution: 203 ms, fetching: 573 ms)

5.1.6. 除5.1.5.外的3种执行期间均需pgsql消耗1x表大小的内存(总比3.2.好)和3x表大小临时表存储
image
image
image
image

5.1.6.1. 可以修改临时表空间在fs上的默认路径/var/lib/postgresql/16/main/base/pgsql_tmp https://dba.stackexchange.com/questions/170661/in-postgres-how-do-i-adjust-the-pgsql-tmp-setting/170665#170665

tmp={目的路径} &&
mkdir -p $tmp /PG_16_202307071 &&
chown -R postgres: $tmp &&
chmod -R 700 $tmp
create tablespace temp_tablespace owner postgres location '{目的路径}';
grant create on tablespace temp_tablespace to public;
alter system set temp_tablespaces = temp_tablespace;
select pg_reload_conf();
create temp table a(b text);

5.2. 总行数对比

SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn") a, (SELECT COUNT(*) FROM ONLY tbm."tbmi_ocr_box_Latn") b) t;
   a    |   b    | diff
--------+--------+--------
 472655 | 472655 | 0
(1 row)
  1. 令文件compare.sql有在mysql中codegen出基于5.1.3.4.3.5.2.校验用的一键生成pgsql
SELECT CONCAT(
    'CREATE FOREIGN TABLE IF NOT EXISTS mysql."',
    TABLE_NAME,
    '"() INHERITS (tbm."',
    TABLE_NAME,
    '") SERVER mysql OPTIONS (dbname \'tbm\', table_name \'',
    TABLE_NAME,
    '\');\n',
    'SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '") a, (SELECT COUNT(*) FROM ONLY tbm."',
    TABLE_NAME,
    '") b) t;\n',
    'SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '" a FULL OUTER JOIN ONLY tbm."',
    TABLE_NAME,
    '" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'tbm'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

6.1. 一键bash

cd pgloader && # 1.1.1.中git clone的pgloader目录路径
(
    /usr/bin/time -v build/bin/pgloader schema.load &&
    /usr/bin/time -v build/bin/pgloader data.load &&
    PGPASSWORD=password /usr/bin/time -v psql -aUusername tbm < compare.sql
) 2>&1 | tee -a stdout+err
  1. 设有表tbm_bilibiliVote结构 https://github.com/n0099/bilibiliVote
    mysql端:
CREATE TABLE `tbm_bilibiliVote` (
  `pid` bigint NOT NULL,
  `authorUid` bigint NOT NULL,
  `authorExpGrade` int DEFAULT NULL,
  `isValid` tinyint(1) NOT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
  `postTime` datetime NOT NULL,
  PRIMARY KEY (`pid`),
  KEY `authorUid` (`authorUid`),
  KEY `isVaild` (`isValid`) USING BTREE,
  KEY `voteFor` (`voteFor`(100),`isValid`) USING BTREE,
  KEY `postTime` (`postTime`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm."tbm_bilibiliVote" (
  pid bigint primary key not null,
  "authorUid" bigint not null,
  "authorExpGrade" integer,
  "isValid" boolean not null,
  "voteBy" text,
  "voteFor" text,
  "replyContent" json,
  "postTime" timestamp with time zone not null
);
create index "idx_163841_authorUid" on "tbm_bilibiliVote" using btree ("authorUid");
create index "idx_163841_isVaild" on "tbm_bilibiliVote" using btree ("isValid");
create index "idx_163841_postTime" on "tbm_bilibiliVote" using btree ("postTime");
create index "idx_163841_voteFor" on "tbm_bilibiliVote" using btree ("voteFor", "isValid");

7.1. 使用5.1.3.中的symmetric difference FULL OUTER JOIN row(table)

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

可得 https://stackoverflow.com/questions/47405732/why-does-postgresql-throw-full-join-is-only-supported-with-merge-joinable-or-ha

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

7.1.1. 使用5.1.5.中的non symmetric difference EXCEPT

SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbm_bilibiliVote" a EXCEPT SELECT * FROM mysql."tbm_bilibiliVote" b) t;
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbm_bilibiliVote" a EXCEPT SELECT * FROM ONLY tbm."tbm_bilibiliVote" b) t;

可得 https://stackoverflow.com/questions/48420438/could-not-identify-an-equality-operator-for-type-json-when-using-distinct

ERROR: could not identify an equality operator for type json

7.1.2. 执行5.1.2.1.

SELECT * FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    a."authorUid" = b."authorUid" AND
    a."isValid" = b."isValid" AND
    a."pid" = b."pid" AND
    a."postTime" = b."postTime" AND
    a."replyContent" = b."replyContent" AND
    a."voteBy" = b."voteBy" AND
    a."voteFor" = b."voteFor" AND
    a."authorExpGrade" = b."authorExpGrade"
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);

会有 https://stackoverflow.com/questions/32843213/operator-does-not-exist-json-json

ERROR: operator does not exist: json = json
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.

7.1.3. 7.中可见列replyContentjson类型

-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::jsonb = b."replyContent"::jsonb AND
count
29510

这恰好等于5.2.

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" UNION ALL SELECT COUNT(*) FROM ONLY tbm."tbm_bilibiliVote";
count
14755
14755

之和

7.1.4. 但即便是json::text

-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::text = b."replyContent"::text AND

也仍然相同

7.1.5. 可以在mysql端将该json列转text

ALTER TABLE tbm_bilibiliVote MODIFY replyContent TEXT NOT NULL;

后再重新按照2.2.仅导入该表(INCLUDING ONLY TABLE NAMES MATCHING 'tbm_bilibiliVote'

7.2. 但即便两端的列replyContent重新导入后均为text类型也仍然有所有行不同
实际上是由于列postTimemysql无时区类型datetime2.3.中的pgloadermysql://默认的CASTING

type datetime to timestamptz

转为pgsql有时区类型timestamptz时由pgloader根据系统时区 UTC+8而非pgsql时区 UTC+0转换为UTC

SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
pid postTime postTime
124498114574 2019-03-10 12:38:17.000000 +00:00 2019-03-10 04:38:17.000000 +00:00
124498200265 2019-03-10 12:44:12.000000 +00:00 2019-03-10 04:44:12.000000 +00:00
...

7.2.1. 由于timestamptz实际上并没有额外存储时区UTC offset https://stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql
因而其并不能解决 https://z.n0099.net/#narrow/near/86397 https://old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz/ https://news.ycombinator.com/item?id=20212671 中争论的带时区引用未来指定datetime问题
https://www.w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_Events

The SQL data types 'date', 'time', and 'timestamp' are field based time values which are intended to be zone offset independent: they are actually, technically, floating time values! The data type 'timestamp with time zone' is the zone offset-dependent equivalent of 'timestamp' in SQL.

https://www.w3.org/International/wiki/WorkingWithTimeZones#Floating_Time

  • Floating Time
    Some observed time values are not related to a specific moment in incremental time. Instead, they need to be combined with local information to determine a range of acceptable incremental time values. We refer to these sorts of time values as "floating times". Floating times are not attached and should never be attached to a particular time zone.
    Examples of floating time events include a user’s birth date, an employee’s hire or termination date, or a list of company holidays.
    For example, suppose that January 1st is considered a holiday in your application. The day "January 1" has a floating-time status as a "holiday". That "day" can begin as early as midnight GMT-14:00 and end as late as midnight of January 2 GMT+12:00, depending on local time. This covers an incremental time range of fifty hours.

再叠加mysql_fdw不论当前pgsql时区 https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES 是什么都不会在将从mysql读到的datetime转换为timestamptz时根据时区重新计算时间而是只修改offset(也就是使用跟mysql类型datetime语义相同的pgsql类型timestamp的语义)

SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
pid postTime postTime
124498114574 2019-03-10 12:38:17.000000 +08:00 2019-03-10 12:38:17.000000 +08:00
124498200265 2019-03-10 12:44:12.000000 +08:00 2019-03-10 12:44:12.000000 +08:00

因而只需要修改pgsql时区

SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT COUNT(*) FROM ONLY mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

便校验一致

7.2.2. 如果执行了7.1.5.需再在pgsql端将该列类型改回jsonjsonb

DROP FOREIGN TABLE mysql."tbm_bilibiliVote"; -- 避免`ALTER TABLE`由于`4.3.2.`而返回假阳性之`ERROR: "tbm_bilibiliVote" is not a table`
ALTER TABLE tbm."tbm_bilibiliVote" ALTER "replyContent" TYPE jsonb USING "replyContent"::jsonb;
-- 修改`6.`中的`TABLE_NAME REGEXP '...'`predicate为`TABLE_NAME = 'tbm_bilibiliVote'`并eval其输出以重新执行校验
VACUUM FULL tbm."tbm_bilibiliVote";
  1. 设有表tbmi_metadata_gif结构:
CREATE TABLE `tbmi_metadata_gif` (
  `imageId` int unsigned NOT NULL,
  `repeatCount` smallint unsigned NOT NULL,
  `colorTableMode` enum('Global','Local') NOT NULL,
  `globalColorTableLength` int NOT NULL,
  `comments` json DEFAULT NULL,
  PRIMARY KEY (`imageId`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm.tbmi_metadata_gif (
  "imageId" bigint primary key not null,
  "repeatCount" integer not null,
  "colorTableMode" "tbmi_metadata_gif_colorTableMode" not null,
  "globalColorTableLength" integer not null,
  comments json
);

8.1. 按照7.1.3.

SELECT COUNT(*) FROM mysql."tbmi_metadata_gif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_gif" b ON
    (a."imageId" = b."imageId") AND
    (a."repeatCount" = b."repeatCount") AND
    (a."colorTableMode" = b."colorTableMode") AND
    (a."globalColorTableLength" = b."globalColorTableLength") AND
    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
WHERE (a."imageId" IS NULL AND a."repeatCount" IS NULL AND a."colorTableMode" IS NULL AND a."globalColorTableLength" IS NULL AND a."comments" IS NULL)
   OR (b."imageId" IS NULL AND b."repeatCount" IS NULL AND b."colorTableMode" IS NULL AND b."globalColorTableLength" IS NULL AND b."comments" IS NULL);

ERROR: unsupported Unicode escape sequence
Detail: \u0000 cannot be converted to text.
Where: JSON data, line 1: ...                   ", "Created by TechSmith\u0000...

8.1.1. 但按照7.1.4.就无差异

-    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
+    (a."comments"::text = b."comments"::text OR (a."comments" IS NULL AND b."comments" IS NULL))

8.1.2. 这是由于jsonb类型不同于本质textjson,会将json解析后存储为pgsql内置类型之集,再由于text类型受c人最爱的 https://en.wikipedia.org/wiki/Null-terminated_string 影响不支持存储256个字节中的唯一一个0x00导致jsonb也无法存储\u0000转义后的0x00进其内部的text turbot/steampipe-postgres-fdw#118
https://www.postgresql.org/docs/current/datatype-json.html

The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type)

Table 8.23. JSON Primitive Types and Corresponding PostgreSQL Types

JSON primitive type PostgreSQL type Notes
string text \u0000 is disallowed, as are Unicode escapes representing characters not available in the database > encoding
number numeric NaN and infinity values are disallowed
boolean boolean Only lowercase true and false spellings are accepted
null (none) SQL NULL is a different concept

8.2. 该表该列中共有如下\u0000 https://codepoints.net/U+0000

SELECT COUNT(*) count, comments::text c
FROM ONLY tbm."tbmi_metadata_gif"
WHERE comments::text LIKE '%\u0000%'
GROUP BY c
ORDER BY count DESC;
count c
31 [" ", "Created by TechSmith\u0000"]
2 ["tvc\u0000"]
1 ["Author:\t\tGod\_job\_dave\u0000\u0000\u0000???\u0001\u0000\u0000\u0000\u0000P??\u0001\u0000"]
1 ["Optimized by Ulead SmartSaver!\u0000"]
1 ["CREATOR: gd-jpeg v1.0 \(using IJG JPEG v62\), quality = 75\n\u0000"]

8.2.1. 使用如下3层如同6.codegen迫真元编程自我复制的quine病毒可在mysql端检查所有含有0x00text类型及其各个长度变种 https://stackoverflow.com/questions/13932750/tinytext-text-mediumtext-and-longtext-maximum-storage-sizes 列并生成ALTER列类型为BLOB(pgloader在pgsql端按照2.3.mysql://源下默认的CASTING会转换为可存储所有256个字节bytea)和SELECT查阅的一键生成mysql:

SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
'SELECT t.*, CONCAT(
    \'ALTER TABLE tbm.`\',
    t,
    \'` MODIFY `\',
    c,
    \'` \',
    REPLACE(COLUMN_TYPE, \'text\', \'blob\'),
    CASE WHEN COLUMNS.IS_NULLABLE = \'true\' THEN \' NULL;\'
    ELSE \' NOT NULL;\'
    END
) `alter`, CONCAT(
    \'SELECT * FROM tbm.`\',
    t,
    \'` WHERE CAST(`\',
    c,
    \'` AS BINARY) LIKE CONCAT(\'\'%\'\', 0x00, \'\'%\'\');\'
) `select`
FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) count, \'',
        TABLE_NAME,
        '\' t, \'',
        COLUMN_NAME,
        '\' c FROM `',
        TABLE_NAME,
        '` WHERE CAST(`',
        COLUMN_NAME,
        '` AS BINARY) LIKE CONCAT(\'%\', 0x00, \'%\')'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t
JOIN information_schema.COLUMNS
  ON count != 0 AND TABLE_SCHEMA = \'tbm\'
 AND TABLE_NAME = t AND COLUMN_NAME = c;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE LIKE '%text%'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

count t c alter select
4321 tbmi_metadata_embedded_exif userComment ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `userComment` BLOB NOT NULL; SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`userComment` AS BINARY) LIKE CONCAT('%', 0x00, '%');
2612 tbmi_metadata_embedded_exif xpAuthor ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `xpAuthor` BLOB NOT NULL; SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`xpAuthor` AS BINARY) LIKE CONCAT('%', 0x00, '%');
0.00s user 0.02s system 0% cpu 42:49.75 total

8.3. 值得注意的是2.2.导入 dimitri/pgloader#1573 不会有截断0x00后字节的WARNING6.校验 EnterpriseDB/mysql_fdw#299 过程中也无法发现以0x00结尾text中的最后一个0x00字节消失

  1. 设有表tbmi_hash结构:
CREATE TABLE `tbmi_hash` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `pHash` binary(8) NOT NULL,
  `averageHash` bigint unsigned NOT NULL,
  `blockMeanHash` binary(32) NOT NULL,
  `marrHildrethHash` binary(72) NOT NULL,
  `thumbHash` tinyblob NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm.tbmi_hash (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "pHash" bytea not null,
  "averageHash" bigint not null,
  "blockMeanHash" bytea not null,
  "marrHildrethHash" bytea not null,
  "thumbHash" bytea not null,
  primary key ("imageId", "frameIndex")
);

9.1. 由于pgsql没有mysql特有(isosql不要求)带unsigned修饰的的int类型 https://stackoverflow.com/questions/20810134/why-unsigned-integer-is-not-available-in-postgresql/59802732#59802732 正如同m$ft精神mvp最爱的中小实体企业c#工控上位机人上壬 https://z.n0099.net/#narrow/near/94726 除非使用扩展 https://github.com/petere/pguint 因而

-  `averageHash` bigint unsigned NOT NULL,
+  "averageHash" bigint not null,

导致该列上限值从 $2^{64} = 18446744073709551615$ 降到 $2^{63} = 9223372036854775807$

9.2. 实际上所谓的opencv_imghash lumina37/aiotieba#63 (comment)averageHash

{AverageHash.Create(), (image, bytes) => image.AverageHash = BitConverter.ToUInt64(bytes)},
本质上更类似于 https://en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有可运算的正整数语义之类型存储而应直接视作斑点二进制 https://en.wikipedia.org/wiki/Binary_large_object

ALTER TABLE tbmi_hash ADD COLUMN averageHash_2 BINARY(8) NOT NULL AFTER averageHash;
UPDATE tbmi_hash SET averageHash_2 = UNHEX(LPAD(HEX(averageHash), 16, '00')); -- https://stackoverflow.com/questions/58611436/how-do-you-convert-an-integer-into-its-condensed-binary-equivalent-e-g-binary3/58657797#comment138021380_58657797
SELECT COUNT(*) FROM tbmi_hash WHERE averageHash_2 != UNHEX(LPAD(HEX(averageHash), 16, '00'));
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_hash DROP COLUMN averageHash, RENAME COLUMN averageHash_2 TO averageHash;

9.3. 一键生成找出所有存在值 $&gt; 2^{63}$bigint unsigned列的mysql

SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
    'SELECT * FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) c, \'',
        TABLE_NAME,
        '\' t FROM `',
        TABLE_NAME,
        '` WHERE `',
        COLUMN_NAME,
        '` > 9223372036854775807'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t WHERE c != 0;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE = 'bigint unsigned'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

9.4.

DROP TABLE tbm."tbmi_hash" CASCADE;

类比7.1.5.重新导入该表

CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_hash"()
    INHERITS (tbm."tbmi_hash")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmi_hash');

Comment is too long (maximum is 65536 characters)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions