使用pgloader将mysql迁移至pgsql并使用mysql_fdw校验表数据

0. 环境
```bash
$ uname -a
Linux azure 6.5.0-1018-azure #19~22.04.2-Ubuntu SMP Thu Mar 21 16:45:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ mysqld --version
/usr/sbin/mysqld  Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))
$ /usr/lib/postgresql/16/bin/postgres --version
postgres (PostgreSQL) 16.2 (Ubuntu 16.2-1.pgdg22.04+1)
```
`pgloader`@`3.6.10`
`mysql_fdw`@`2.9.1`
库名假定为`tbm`

1. 使用某[11年历史](https://github.com/dimitri/pgloader/commit/1dfa170a49cbb3b42580e545467903dd643cc934)的commonlisp程序 https://github.com/dimitri/pgloader 可以在多个数据源（其中有着mysql和pgsql）之间导入导出

1.1 由于 https://github.com/dimitri/pgloader/issues/1211 ，从[postgre社区apt source](https://apt.postgresql.org)安装的`pgloader 3.6.10-2.pgdg22.04+1`无法连接mysqld
```bash
$ apt show -a pgloader
Package: pgloader
Version: 3.6.10-2.pgdg22.04+1
Priority: optional
Section: database
Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Installed-Size: 33.9 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libsqlite3-0, libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.3 MB
APT-Sources: https://apt.postgresql.org/pub/repos/apt jammy-pgdg/main amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.

Package: pgloader
Version: 3.6.3-1ubuntu1
Priority: extra
Section: universe/misc
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 33.8 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.5 MB
APT-Sources: http://azure.archive.ubuntu.com/ubuntu jammy/universe amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.
```

1.1.1. 建议从[源码自行编译](https://github.com/dimitri/pgloader/blob/v3.6.10/INSTALL.md)`3.6.10`
```bash
sudo apt install -y pgloader sbcl && # 通过安装apt中的pgloader来安装其运行时所需依赖和sbcl commonlisp编译器
git clone https://github.com/dimitri/pgloader &&
cd pgloader &&
git checkout v3.6.10 &&
make DYNSIZE=1024 && # 内存限制
build/bin/pgloader --version && # pgloader version "3.6.af8c3c1" compiled with SBCL 2.1.11.debian
sudo apt purge -y sbcl
```

1.1.2. 清理：
```bash
sudo apt purge -y pgloader &&
sudo apt autoremove --purge
```

1.2. 由于 https://github.com/dimitri/pgloader/issues/782 ，pgloader只能通过仍在使用`mysql 8.0.34`中deprecated的[`mysql_native_password`](https://dev.mysql.com/doc/refman/8.0/en/native-pluggable-authentication.html)用户登录mysqld

1.2.1. 建议ad-hoc地临时创建一个[最小权限](https://en.wikipedia.org/wiki/Principle_of_least_privilege)（只[`GRANT`](https://dev.mysql.com/doc/refman/8.0/en/grant.html)库/表级权限）用户用于pgloader而非修改现有用户的[`default auth plugin`](https://dev.mysql.com/doc/refman/8.0/en/pluggable-authentication.html#pluggable-authentication-default-plugin)

2. 令文件`schema.load`有
```sql
LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

WITH quote identifiers, schema only

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;
```

2.1. 可以使用[`INCLUDING ONLY TABLE NAMES MATCHING ~/regexp/, 'exact'`](https://pgloader.readthedocs.io/en/latest/ref/mysql.html#including-only-table-names-matching)来filter要导入的表名

2.2. 令文件`data.load`有
```sql
LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

 SET MySQL PARAMETERS -- https://github.com/dimitri/pgloader/issues/999
     net_read_timeout  = '300',
     net_write_timeout = '300'

WITH quote identifiers, data only, on error stop, workers = 2, concurrency = 1, batch rows = 1000, batch size = 8MB, prefetch rows = 10000 -- 酌情tunning https://pgloader.readthedocs.io/en/latest/command.html#batch-behaviour-options

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;
```

2.3. 分开导入`表结构`和`表数据`是因为pgloader默认行为是按照指定和[`mysql://`源下默认的`CASTING`](https://pgloader.readthedocs.io/en/latest/ref/mysql.html#default-mysql-casting-rules)创建目的`表结构`（除了索引），再[`COPY`](https://www.postgresql.org/docs/current/sql-copy.html)导入`表数据`，最后并行`{表索引数量}`个[`CREATE INDEX`](https://www.postgresql.org/docs/current/sql-createindex.html)并且不等待其完成而是并行地开始下一个表的过程： https://pgloader.readthedocs.io/en/latest/batches.html
> By default, as many CREATE INDEX threads as the maximum number of indexes per table are found in your source schema. It is possible to set the max parallel create index WITH option to another number in case there’s just too many of them to create.

由于未知原因在`3.6.af8c3c1`中`WITH max parallel create index 1`并没有实际作用仍然会执行上述流程所以不得不分开在`WITH schema only`的`.load`中先完成[`CREATE INDEX`](https://www.postgresql.org/docs/current/sql-createindex.html) https://github.com/dimitri/pgloader/issues/1576 以减少io争用

2.3.1. 出于类似的原因也建议先禁用[`autovacuum`](https://www.postgresql.org/docs/current/runtime-config-autovacuum.html)
```pgsql
ALTER SYSTEM SET autovacuum = off;
SELECT pg_reload_conf();
SELECT * FROM pg_settings WHERE name = 'autovacuum';
```
清理：待pgloader完成后再手动[`VACUMM FULL`](https://www.postgresql.org/docs/current/sql-vacuum.html) https://dba.stackexchange.com/questions/130496/is-it-worth-it-to-run-vacuum-on-a-table-that-only-receives-inserts/130514#130514
```pgsql
ALTER SYSTEM RESET autovacuum;
SELECT pg_reload_conf();
VACUMM FULL;
```

2.4. `WITH quote identifiers`是为了方便后续校验两端`表数据`是否一致时可以假定列名相同，您可以之后再重命名为postgre社区习惯的`snake_case`以减少满屏幕`""`（相当于mysql的` `` `）

2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql
```pgsql
SELECT * FROM pg_stat_activity WHERE application_name = 'pgloader';
```
可根据`表大小`猜测进度 https://stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505
```pgsql
SELECT table_schema, table_name, pg_relation_size('"'||table_schema||'"."'||table_name||'"')::float8 / 1024 / 1024 "MiB"
FROM information_schema.tables
WHERE table_schema = 'tbm'
ORDER BY 3 DESC;
```

3. 设有表`tbmc_f97650_thread`结构
mysql端：
```mysql
CREATE TABLE `tbmc_f97650_thread` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `tid` bigint unsigned NOT NULL,
  `threadType` bigint unsigned NOT NULL,
  `stickyType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `topicType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `isGood` tinyint DEFAULT NULL,
  `title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `authorUid` bigint NOT NULL,
  `postedAt` int unsigned NOT NULL,
  `latestReplyPostedAt` int unsigned NOT NULL,
  `latestReplierUid` bigint DEFAULT NULL,
  `replyCount` int unsigned DEFAULT NULL,
  `viewCount` int unsigned DEFAULT NULL,
  `shareCount` int unsigned DEFAULT NULL,
  `agreeCount` int DEFAULT NULL,
  `disagreeCount` int DEFAULT NULL,
  `zan` blob,
  `geolocation` blob,
  `authorPhoneType` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `createdAt` int unsigned NOT NULL,
  `updatedAt` int unsigned DEFAULT NULL,
  `lastSeenAt` int unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `tid` (`tid`) USING BTREE,
  KEY `authorUid` (`authorUid`),
  KEY `postTime` (`postedAt` DESC)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
```
由pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`所创建：
```pgsql
create table tbm_dev.tbmc_f97650_thread (
  id bigint primary key not null default nextval('tbmc_f97650_thread_id_seq'::regclass),
  tid bigint not null,
  "threadType" bigint not null,
  "stickyType" text,
  "topicType" text,
  "isGood" smallint,
  title text not null,
  "authorUid" bigint not null,
  "postedAt" bigint not null,
  "latestReplyPostedAt" bigint not null,
  "latestReplierUid" bigint,
  "replyCount" bigint,
  "viewCount" bigint,
  "shareCount" bigint,
  "agreeCount" integer,
  "disagreeCount" integer,
  zan bytea,
  geolocation bytea,
  "authorPhoneType" text,
  "createdAt" bigint not null,
  "updatedAt" bigint,
  "lastSeenAt" bigint
);
create index "idx_23173_authorUid" on tbmc_f97650_thread using btree ("authorUid");
create index "idx_23173_postTime" on tbmc_f97650_thread using btree ("postedAt");
create unique index idx_23173_tid on tbmc_f97650_thread using btree (tid);
```

3.1. 可以通过将表中`每列值`concat起来hash后再aggreate地对`每行hash`再hash来得出`整个表的hash`
[mermaid.live](https://mermaid.live/edit#pako:eNptkmGPmkAQhv_KZj_UayJEQKXSpInVy32rBr00rTRmZUfZdNkl7HKWM_73DgjeNSkJZN7Zd2ee3eFCU82BRvQo9TnNWGnJdpkogs9RgORmZwVPEmWzEhjf1gWgMFakv-tOWF2ItIuFedK6tQsrmwSrbKbL57ZEoY0FPrcYSmbB2BgKWa__lxXQ7Skby0JXqll_EXDuY4Oo0At2KuEuuDD_6Fem8HsCLXXKrNDqjrXOtIKOPMXj9RxVwdkbk7EbADW3v97fCnGcxerbYr7df988DMhgSFzX_eg4X0ipzzt8G0ShThKI1OpEUvNCpFBAjC0xTQ6VJVVjIKZgKRCmODkLhMI8UzUBk7ICupZYDvvhiX3Hn0y7JvuMmey-3io0PcWr5_X-hvaASGQVLx9j8vUHwTGSzeN6Hs-3q5gMPhyMlp_VoGc2bYXdPUJ-BecbMwcpcoE30hrfoEzf9T2aZQcJ7QId0hzKnAmOv9el2ZVQm0EOCY0w5HBklbQJTdQVrTgSvalVSiNbVjCktyEsBc6S5TQ6MmkwWzD1U-u8N6Gk0YX-oZE3Dl1_NB77My8IfW8084e0ppHjfwpcLxjPxhN_4oXB9Dqkr20Bzw0mo9ALp8HUm_n-bBJe_wL_Vgxu)
```mermaid
flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --CONCAT_WS(' ', ...)--> row[row\nsingle long csv line string but using space and without any escape]
    row --sha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash
```
mysql
```mysql
SET SESSION group_concat_max_len = 18446744073709551615; -- 默认只有1024超过会带WARNING截断 https://dba.stackexchange.com/questions/197746/why-set-group-concat-max-len-below-the-maximum
SELECT SHA2(GROUP_CONCAT(SHA2(CONCAT_WS(' ',
    tid, threadType, stickyType, topicType, isGood, title, authorUid, postedAt, latestReplyPostedAt, latestReplierUid, replyCount, viewCount, shareCount, agreeCount, disagreeCount, LOWER(HEX(zan)), LOWER(HEX(geolocation)), authorPhoneType, createdAt, updatedAt, lastSeenAt
), 512) ORDER BY tid SEPARATOR '\n'), 512) hash FROM tbmc_f97650_thread;
```
blob类型字段必须套`LOWER(HEX())`因为mysql默认的`blob2text`输出类似[`0xDEADBEAF`](https://en.wikipedia.org/wiki/Hexspeak)形式而pgsql是`deadbeaf`
pgsql
```pgsql
SELECT encode(sha512(convert_to(array_to_string(array_agg(encode(sha512(convert_to(concat_ws(' ',
    tid, "threadType", "stickyType", "topicType", "isGood", title, "authorUid", "postedAt", "latestReplyPostedAt", "latestReplierUid", "replyCount", "viewCount", "shareCount", "agreeCount", "disagreeCount", encode(zan, 'hex'), encode(geolocation, 'hex'), "authorPhoneType", "createdAt", "updatedAt", "lastSeenAt"
), 'utf8')), 'hex') ORDER BY tid), E'\n'), 'utf8')), 'hex') hash FROM tbmc_f97650_thread;
```

3.2. 然而在大表上两者最终都会耗尽内存而失败，并且需要
- 显式列出所有列名
- 给mysql表的blob列套`LOWER(HEX())`
- 在`GROUP_CONCAT()`时按照[PK](https://en.wikipedia.org/wiki/Primary_key)或[UK](https://en.wikipedia.org/wiki/Unique_key)排序行否则在pgsql中默认[`迫真假随机`序](https://stackoverflow.com/questions/60582206/what-is-the-default-select-order-in-postgresql-or-mysql)
- 导致难以针对100+个表`一键生成`

3.3. 理论上可以通过赋予纯SQL（无需臭名昭著的命令式[`存储过程`](https://en.wikipedia.org/wiki/Stored_procedure)或外部程序辅助）[图灵完备性](https://news.ycombinator.com/item?id=22964441) https://wiki.postgresql.org/wiki/Mandelbrot_set 导致其踏入了同一条[`Turing Tarpit`](https://news.ycombinator.com/item?id=10566855)之河流的[`rCTE`](https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL)来`LIMIT 10000`地每次hash的行并对`每10k行aggreate之hash`再aggreate地hash从而减少每行hash所aggreate的字符串长度
[mermaid.live](https://mermaid.live/edit#pako:eNrVVG2L2kAQ_ivLfqhX0JAX32KhYPU4DtpTokdpzSFrMpql627Ibs7mxP_eSUw8S0vpx3Y_hHlmn5l5MpPJkUYqBjqiW6EOUcIyQ5bTUBI8Ww4i1ivD4zCUJsmAxcsiBQTa8OhbUQOjUh7VNtd3SlV0bkTpYLlJVPZYpUiVNhCPDZqCGdAmgFQU8995OdQxWUmZqFyW988cDo2tUSo0gO0yuICY65_wC5P43IESKmKGK3mRNU-UhFp5hK_X6MjTmL1q0mYBIMfm6borpNP5eP_pfkkcGw_yJrOHyXi5_ry4aZFWm1iW9bbTeU8ydfh_wzKcxl9EobEqAQ6Fy50AIpTckUg_E8ElEG0ydJNNbkheEohOWQSEyZgcOI4B_UwWBHTEUni6pPxFPE7c7bi9fl17nTCd_Mvsunt_JFcII-6C2eN8fW70DTaYzILpbUA-fCG4fWRxOx8H4-UsIK03G63EO9lqJqCrDKuLhQUlHM6Nj0HwPccPuSK-dlY3Va-lGbYRUF3QNt1Dtmc8xr_CsYwKqUlgDyEdoRnDluXChDSUJ6TiJqlFISM6MlkObXrenSnHFWT7xpky-VWpa0hHR_qdjnzb8jzf8X3X73cHw6HTpgUdddxu13L6ttdze54z7Pld99SmL1UGx-o6ntfzhrY9GHh91-6ffgBfg4Xb)
```mermaid
flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT rest\nCONCAT_WS(' ', ...)--> row
    row[row\nsingle long csv line string but using space and without any escape]
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT rest\nsha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash
```

4. https://stackoverflow.com/questions/59715303/is-there-any-way-join-tables-of-mysql-and-postgresql/59721228#59721228 中`某so纯路人`指出可以使用postgre社区[`牛逼哄哄`](https://z.n0099.net/#narrow/near/103088)的某知名企业[`EnterpriseDB`](https://en.wikipedia.org/wiki/EnterpriseDB)所开发的某pgsql扩展 https://github.com/EnterpriseDB/mysql_fdw 通过[foreign data wrapper](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)来在pgsql中将外部数据当做迫真[`物化视图`](https://en.wikipedia.org/wiki/Materialized_view)查询

4.1. 安装：
```bash
git clone https://github.com/EnterpriseDB/mysql_fdw &&
git checkout REL-2_9_1 &&
sudo apt install -y libmysqlclient-dev postgresql-server-dev-16 &&
export PATH=/usr/local/pgsql/bin/:$PATH &&
make USE_PGXS=1 &&
sudo make USE_PGXS=1 install &&
make USE_PGXS=1 installcheck &&
sudo apt purge -y postgresql-server-dev-16
```
清理：
```bash
sudo apt purge -y libmysqlclient-dev &&
sudo make USE_PGXS=1 uninstall &&
rm -rf mysql_fdw
```

4.2. 安装该pgsql扩展
```pgsql
ALTER USER tbm WITH SUPERUSER;
CREATE EXTENSION IF NOT EXISTS mysql_fdw;
ALTER USER tbm WITH NOSUPERUSER;
CREATE SERVER IF NOT EXISTS mysql
    FOREIGN DATA WRAPPER mysql_fdw
    OPTIONS (host '127.0.0.1', port '3306');
CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER -- 当前登录用户 也可以用PUBLIC使得用户名/密码对所有pgsql用户可见
    SERVER mysql
    OPTIONS (username 'username', password 'password');
```
清理：
```pgsql
DROP USER MAPPING IF EXISTS FOR CURRENT_USER SERVER mysql; -- FOR CURRENT_USER同上
DROP SCHEMA IF EXISTS mysql CASCADE;
DROP SERVER IF EXISTS mysql CASCADE;
DROP EXTENSION IF EXISTS mysql_fdw CASCADE;
DROP DATABASE contrib_regression; -- make USE_PGXS=1 installcheck所产生
```

4.3. 在pgsql中创建fdw表
```pgsql
CREATE SCHEMA IF NOT EXISTS mysql;
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmc_f97650_thread"()
    INHERITS (tbm."tbmc_f97650_thread")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmc_f97650_thread');
```
使用[`table inheritance`](https://www.postgresql.org/docs/current/ddl-inherit.html)是为了避免在`2.`已经`pgloader schema.load`后再写一遍fdw`表结构`，但这会导致

4.3.1. 被继承表`tbmc_f97650_thread`无法再直接`UPDATE`/`DELETE`返回`[HV005] ERROR: system attribute "tableoid" can't be fetched from remote relation`而必须加`ONLY` https://github.com/EnterpriseDB/mysql_fdw/issues/300 但仍然可以`INSERT`

4.3.1.1. 这个bug可能会在未来版本的`mysql_fdw`中修复但在这里并不重要，因为这种`迫真假锁全表`正好避免了在校验期间`UPDATE`/`DELETE`（但防不了`INSERT`）（默认[事务隔离级别](https://www.postgresql.org/docs/current/transaction-iso.html)是`READ COMMITTED`而非mysql默认的`REPEATABLE READ` https://github.com/n0099/open-tbm/issues/32#issuecomment-1404661123 当然您也可以改）了pgsql表导致`假阴性`

4.3.2. 由于[`table inheritance`](https://www.postgresql.org/docs/current/ddl-inherit.html)的本质：
```pgsql
SELECT * FROM tbm."tbmc_f97650_thread"
```
也会出现
```pgsql
SELECT * FROM mysql."tbmc_f97650_thread"
```
中的所有结果：也就是被继承表并集了所有继承表（在这里通过查询mysql），相当于
```pgsql
SELECT * FROM ONLY tbm."tbmc_f97650_thread" UNION ALL SELECT * FROM mysql."tbmc_f97650_thread"
```

5. 设有表`tbmi_ocr_box_Latn`结构：
mysql端：
```mysql
CREATE TABLE `tbmi_ocr_box_Latn` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `centerPointX` smallint unsigned NOT NULL,
  `centerPointY` smallint unsigned NOT NULL,
  `width` smallint unsigned NOT NULL,
  `height` smallint unsigned NOT NULL,
  `rotationDegrees` smallint unsigned NOT NULL,
  `recognizer` enum('TesseractHorizontal','TesseractVertical','PaddleOCRv4','PaddleOCRv3') NOT NULL,
  `confidence` tinyint unsigned NOT NULL,
  `text` text NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`,`centerPointX`,`centerPointY`,`width`,`height`,`rotationDegrees`,`recognizer`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
```
由pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`所创建：
```pgsql
create table tbm."tbmi_ocr_box_Latn" (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "centerPointX" integer not null,
  "centerPointY" integer not null,
  width integer not null,
  height integer not null,
  "rotationDegrees" integer not null,
  recognizer "tbmi_ocr_box_Latn_recognizer" not null,
  confidence smallint not null,
  text text not null,
  primary key ("imageId", "frameIndex", "centerPointX", "centerPointY", width, height, "rotationDegrees", recognizer)
);
```

5.1. 根据`某so纯路人`[symmdiff](https://github.com/n0099/tbclient.protobuf/issues/2#issuecomment-1417424277)table有4种写法：

5.1.1. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/15333054#15333054
```pgsql
-- symmetric difference FULL OUTER JOIN USING (...columns)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b
  USING ("imageId", "frameIndex", "centerPointX", "centerPointY", "width", "height", "rotationDegrees", "recognizer", "confidence", "text")
WHERE (a."imageId" IS NULL AND a."frameIndex" IS NULL AND a."centerPointX" IS NULL AND a."centerPointY" IS NULL AND a."width" IS NULL AND a."height" IS NULL AND a."rotationDegrees" IS NULL AND a."recognizer" IS NULL AND a."confidence" IS NULL AND a."text" IS NULL)
   OR (b."imageId" IS NULL AND b."frameIndex" IS NULL AND b."centerPointX" IS NULL AND b."centerPointY" IS NULL AND b."width" IS NULL AND b."height" IS NULL AND b."rotationDegrees" IS NULL AND b."recognizer" IS NULL AND b."confidence" IS NULL AND b."text" IS NULL);
```
```c
1 row retrieved starting from 1 in 5 s 745 ms (execution: 5 s 698 ms, fetching: 47 ms)
```

5.1.1.1. `一键生成`
```pgsql
SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b \n  USING ("' ||
    string_agg(column_name, '", "' ORDER BY ordinal_position) ||
    E'")\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbmi_ocr_box_Latn'
GROUP BY table_name;
```

5.1.1.2. 也可以使用`NATURAL FULL OUTER JOIN ONLY` https://www.postgresql.org/docs/current/queries-table-expressions.html 省略`USING (...)`
> Finally, NATURAL is a shorthand form of USING: it forms a USING list consisting of all column names that appear in both input tables. As with USING, these columns appear only once in the output table. If there are no common column names, NATURAL JOIN behaves like JOIN ... ON TRUE, producing a cross-product join.

5.1.2. 但`5.1.1.`和`5.1.1.2.`并不能用于有列类型允许`NULL`之存在的表如`7.`中的`tbm_bilibiliVote`表有着
```mysql
[...]
  `authorExpGrade` int DEFAULT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
[...]
```
因为根据sql特色之 https://en.wikipedia.org/wiki/Three-valued_logic `NULL = NULL`是`UNKNOWN`而`UNKNOWN`转`bool`是`false`于是不会被`ON`/`WHERE`等clause视作相同 https://stackoverflow.com/questions/14366004/sql-server-join-missing-null-values

5.1.2.1. 因而需要将`5.1.1.`中的`USING (...columns)`展开成对所有列`=`operator以及额外的`OR (IS NULL AND IS NULL)`
```pgsql
SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    (a."pid" = b."pid") AND
    (a."authorUid" = b."authorUid") AND
    (a."authorExpGrade" = b."authorExpGrade" OR (a."authorExpGrade" IS NULL AND b."authorExpGrade" IS NULL)) AND
    (a."isValid" = b."isValid") AND
    (a."voteBy" = b."voteBy" OR (a."voteBy" IS NULL AND b."voteBy" IS NULL)) AND
    (a."voteFor" = b."voteFor" OR (a."voteFor" IS NULL AND b."voteFor" IS NULL)) AND
    (a."replyContent" = b."replyContent") AND
    (a."postTime" = b."postTime")
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);
```

5.1.2.1.1. 不对`NOT NULL`的列做额外判断是为了避免`7.1.`中的
> FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

5.1.2.2. `一键生成`
```pgsql
SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b ON\n' ||
    string_agg(
        '    (a."' ||
        column_name ||
        '" = b."' ||
        column_name ||
        CASE WHEN is_nullable::bool = true THEN
            '" OR (a."' ||
            column_name ||
            '" IS NULL AND b."' ||
            column_name ||
            '" IS NULL))'
        ELSE '")'
        END,
        E' AND\n' ORDER BY ordinal_position
    ) ||
    E'\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbm_bilibiliVote'
GROUP BY table_name;
```

5.1.3. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/49381589#49381589
```pgsql
-- symmetric difference FULL OUTER JOIN row(table)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
```
```c
1 row retrieved starting from 1 in 6 s 464 ms (execution: 6 s 345 ms, fetching: 119 ms)
```

5.1.3.1. 其中`row(t)`是 https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

5.1.3.2. 这也避免了`5.1.2.`

5.1.4. https://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables
```pgsql
-- non symmetric difference EXCEPT mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b) t;
```
```c
1 row retrieved starting from 1 in 5 s 137 ms (execution: 4 s 903 ms, fetching: 234 ms)
```
由于`EXCEPT`不是[symmdiff](https://en.wikipedia.org/wiki/Symmetric_difference)之[`FULL OUTER JOIN`](https://stackoverflow.com/questions/3308122/how-do-i-decide-when-to-use-right-joins-left-joins-or-inner-joins-or-how-to-dete/3308153#3308153)而是`SELECT a.* FROM a LEFT OUTER JOIN b`所以只能执行两遍2x耗时正如同没有`FULL OUTER JOIN`可用的mysql人 https://stackoverflow.com/questions/4796872/how-can-i-do-a-full-outer-join-in-mysql
```pgsql
-- non symmetric difference EXCEPT tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b) t;
```
```c
1 row retrieved starting from 1 in 5 s 793 ms (execution: 5 s 527 ms, fetching: 266 ms)
```

5.1.4.1. 由于未知bug
```pgsql
-- symmetric difference EXCEPT
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b
) t;
```
`UNION ALL`后始终为0行即便单独查询有行
```c
1 row retrieved starting from 1 in 8 s 678 ms (execution: 8 s 595 ms, fetching: 83 ms)
```

5.1.4.2. 这也避免了`5.1.2.`

5.1.5. 典型的误用`NOT EXISTS`
```pgsql
-- non symmetric difference WHERE NOT EXISTS mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)) t;
```
```c
1 row retrieved starting from 1 in 764 ms (execution: 250 ms, fetching: 514 ms)
```
因为`SELECT * FROM mysql."tbmi_ocr_box_Latn"`在此显然返回`非0行`（因而只需要读一行`mysql."tbmi_ocr_box_Latn"`几百ms后就会立即返回）从而使得该predicate在非空表上恒真
```pgsql
-- non symmetric difference WHERE NOT EXISTS tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)) t;
```
```c
1 row retrieved starting from 1 in 516 ms (execution: 125 ms, fetching: 391 ms)
```

5.1.5.1. 由于未知bug
```pgsql
-- symmetric difference WHERE NOT EXISTS
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)
) t;
```
`UNION ALL`后始终为`0行`即便单独查询有`非0行`
```c
1 row retrieved starting from 1 in 776 ms (execution: 203 ms, fetching: 573 ms)
```

5.1.6. 除`5.1.5.`外的3种执行期间均需pgsql消耗`1x表大小`的内存（总比`3.2.`好）和`3x表大小`的`临时表`存储
![image](https://github.com/n0099/open-tbm/assets/13030387/684000e3-2723-4355-a858-197fb8fde659)
![image](https://github.com/n0099/open-tbm/assets/13030387/58664dde-dbbc-4f99-a7e4-3920ed9d8479)
![image](https://github.com/n0099/open-tbm/assets/13030387/b957a417-eb71-4134-89e2-97573965f414)
![image](https://github.com/n0099/open-tbm/assets/13030387/63384df1-51b6-4bb5-8fa6-8b5a1a35113f)

5.1.6.1. 可以修改`临时表空间`在fs上的默认路径`/var/lib/postgresql/16/main/base/pgsql_tmp` https://dba.stackexchange.com/questions/170661/in-postgres-how-do-i-adjust-the-pgsql-tmp-setting/170665#170665
```bash
tmp={目的路径} &&
mkdir -p $tmp /PG_16_202307071 &&
chown -R postgres: $tmp &&
chmod -R 700 $tmp
```
```pgsql
create tablespace temp_tablespace owner postgres location '{目的路径}';
grant create on tablespace temp_tablespace to public;
alter system set temp_tablespaces = temp_tablespace;
select pg_reload_conf();
create temp table a(b text);
```

5.2. 总行数对比
```pgsql
SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn") a, (SELECT COUNT(*) FROM ONLY tbm."tbmi_ocr_box_Latn") b) t;
   a    |   b    | diff
--------+--------+--------
 472655 | 472655 | 0
(1 row)
```

6. 令文件`compare.sql`有在mysql中[codegen](https://en.wikipedia.org/wiki/Code_generation)出基于`5.1.3.`、`4.3.`和`5.2.`校验用的`一键生成`pgsql
```mysql
SELECT CONCAT(
    'CREATE FOREIGN TABLE IF NOT EXISTS mysql."',
    TABLE_NAME,
    '"() INHERITS (tbm."',
    TABLE_NAME,
    '") SERVER mysql OPTIONS (dbname \'tbm\', table_name \'',
    TABLE_NAME,
    '\');\n',
    'SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '") a, (SELECT COUNT(*) FROM ONLY tbm."',
    TABLE_NAME,
    '") b) t;\n',
    'SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '" a FULL OUTER JOIN ONLY tbm."',
    TABLE_NAME,
    '" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'tbm'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;
```
REGEXP部分同`2.1.`

6.1. 一键bash
```bash
cd pgloader && # 1.1.1.中git clone的pgloader目录路径
(
    /usr/bin/time -v build/bin/pgloader schema.load &&
    /usr/bin/time -v build/bin/pgloader data.load &&
    PGPASSWORD=password /usr/bin/time -v psql -aUusername tbm < compare.sql
) 2>&1 | tee -a stdout+err
```

7. 设有表`tbm_bilibiliVote`结构 https://github.com/n0099/bilibiliVote
mysql端：
```mysql
CREATE TABLE `tbm_bilibiliVote` (
  `pid` bigint NOT NULL,
  `authorUid` bigint NOT NULL,
  `authorExpGrade` int DEFAULT NULL,
  `isValid` tinyint(1) NOT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
  `postTime` datetime NOT NULL,
  PRIMARY KEY (`pid`),
  KEY `authorUid` (`authorUid`),
  KEY `isVaild` (`isValid`) USING BTREE,
  KEY `voteFor` (`voteFor`(100),`isValid`) USING BTREE,
  KEY `postTime` (`postTime`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
```
由pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`所创建：
```pgsql
create table tbm."tbm_bilibiliVote" (
  pid bigint primary key not null,
  "authorUid" bigint not null,
  "authorExpGrade" integer,
  "isValid" boolean not null,
  "voteBy" text,
  "voteFor" text,
  "replyContent" json,
  "postTime" timestamp with time zone not null
);
create index "idx_163841_authorUid" on "tbm_bilibiliVote" using btree ("authorUid");
create index "idx_163841_isVaild" on "tbm_bilibiliVote" using btree ("isValid");
create index "idx_163841_postTime" on "tbm_bilibiliVote" using btree ("postTime");
create index "idx_163841_voteFor" on "tbm_bilibiliVote" using btree ("voteFor", "isValid");
```

7.1. 使用`5.1.3.`中的`symmetric difference FULL OUTER JOIN row(table)`
```pgsql
SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
```
可得 https://stackoverflow.com/questions/47405732/why-does-postgresql-throw-full-join-is-only-supported-with-merge-joinable-or-ha
```c
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
```

7.1.1. 使用`5.1.5.`中的`non symmetric difference EXCEPT`
```pgsql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbm_bilibiliVote" a EXCEPT SELECT * FROM mysql."tbm_bilibiliVote" b) t;
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbm_bilibiliVote" a EXCEPT SELECT * FROM ONLY tbm."tbm_bilibiliVote" b) t;
```
可得 https://stackoverflow.com/questions/48420438/could-not-identify-an-equality-operator-for-type-json-when-using-distinct
```c
ERROR: could not identify an equality operator for type json
```

7.1.2. 执行`5.1.2.1.`
```pgsql
SELECT * FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    a."authorUid" = b."authorUid" AND
    a."isValid" = b."isValid" AND
    a."pid" = b."pid" AND
    a."postTime" = b."postTime" AND
    a."replyContent" = b."replyContent" AND
    a."voteBy" = b."voteBy" AND
    a."voteFor" = b."voteFor" AND
    a."authorExpGrade" = b."authorExpGrade"
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);
```
会有 https://stackoverflow.com/questions/32843213/operator-does-not-exist-json-json
```c
ERROR: operator does not exist: json = json
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
```

7.1.3. `7.`中可见列`replyContent`是`json`类型
```diff
-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::jsonb = b."replyContent"::jsonb AND
```
| count |
| :---- |
| 29510 |

这恰好等于`5.2.`
```pgsql
SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" UNION ALL SELECT COUNT(*) FROM ONLY tbm."tbm_bilibiliVote";
```
| count |
| :---- |
| 14755 |
| 14755 |

之和

7.1.4. 但即便是`json::text`
```diff
-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::text = b."replyContent"::text AND
```
也仍然相同

7.1.5. 可以在mysql端将该`json`列转`text`
```mysql
ALTER TABLE tbm_bilibiliVote MODIFY replyContent TEXT NOT NULL;
```
后再重新按照`2.2.`仅导入该表（`INCLUDING ONLY TABLE NAMES MATCHING 'tbm_bilibiliVote'`）

7.2. 但即便两端的列`replyContent`重新导入后均为`text`类型也仍然有`所有行`不同
实际上是由于列`postTime`mysql`无时区类型datetime`由`2.3.`中的pgloader`mysql://`默认的`CASTING`
```
type datetime to timestamptz
```
转为pgsql`有时区类型timestamptz`时由pgloader根据`系统时区` `UTC+8`而非`pgsql时区` `UTC+0`转换为`UTC`
```pgsql
SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
```
| pid | postTime | postTime |
| :--- | :--- | :--- |
| 124498114574 | 2019-03-10 12:38:17.000000 +00:00 | 2019-03-10 04:38:17.000000 +00:00 |
| 124498200265 | 2019-03-10 12:44:12.000000 +00:00 | 2019-03-10 04:44:12.000000 +00:00 |
| ... |

7.2.1. 由于`timestamptz`实际上并没有额外存储`时区UTC offset` https://stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql
因而其并不能解决 https://z.n0099.net/#narrow/near/86397 https://old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz/ https://news.ycombinator.com/item?id=20212671 中争论的`带时区引用未来指定datetime`问题
https://www.w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_Events
> The SQL data types 'date', 'time', and 'timestamp' are field based time values which are intended to be zone offset independent: they are actually, technically, floating time values! The data type 'timestamp with time zone' is the zone offset-dependent equivalent of 'timestamp' in SQL.

https://www.w3.org/International/wiki/WorkingWithTimeZones#Floating_Time
> - Floating Time
>   Some observed time values are not related to a specific moment in incremental time. Instead, they need to be combined with local information to determine a range of acceptable incremental time values. We refer to these sorts of time values as "floating times". Floating times are not attached and should never be attached to a particular time zone.
>   Examples of floating time events include a user’s birth date, an employee’s hire or termination date, or a list of company holidays.
>   For example, suppose that January 1st is considered a holiday in your application. The day "January 1" has a floating-time status as a "holiday". That "day" can begin as early as midnight GMT-14:00 and end as late as midnight of January 2 GMT+12:00, depending on local time. This covers an incremental time range of fifty hours.

再叠加mysql_fdw不论当前`pgsql时区` https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES 是什么都不会在将从mysql读到的`datetime`转换为`timestamptz`时根据时区重新计算时间而是只修改offset（也就是使用跟mysql类型`datetime`语义相同的pgsql类型`timestamp`的语义）
```pgsql
SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
```
| pid | postTime | postTime |
| :--- | :--- | :--- |
| 124498114574 | 2019-03-10 12:38:17.000000 +08:00 | 2019-03-10 12:38:17.000000 +08:00 |
| 124498200265 | 2019-03-10 12:44:12.000000 +08:00 | 2019-03-10 12:44:12.000000 +08:00 |

因而只需要修改`pgsql时区`
```pgsql
SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT COUNT(*) FROM ONLY mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
```
便校验一致

7.2.2. 如果执行了`7.1.5.`需再在pgsql端将该列类型改回`json`或`jsonb`
```pgsql
DROP FOREIGN TABLE mysql."tbm_bilibiliVote"; -- 避免`ALTER TABLE`由于`4.3.2.`而返回假阳性之`ERROR: "tbm_bilibiliVote" is not a table`
ALTER TABLE tbm."tbm_bilibiliVote" ALTER "replyContent" TYPE jsonb USING "replyContent"::jsonb;
-- 修改`6.`中的`TABLE_NAME REGEXP '...'`predicate为`TABLE_NAME = 'tbm_bilibiliVote'`并eval其输出以重新执行校验
VACUUM FULL tbm."tbm_bilibiliVote";
```

8. 设有表`tbmi_metadata_gif`结构：
```mysql
CREATE TABLE `tbmi_metadata_gif` (
  `imageId` int unsigned NOT NULL,
  `repeatCount` smallint unsigned NOT NULL,
  `colorTableMode` enum('Global','Local') NOT NULL,
  `globalColorTableLength` int NOT NULL,
  `comments` json DEFAULT NULL,
  PRIMARY KEY (`imageId`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
```
由pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`所创建：
```pgsql
create table tbm.tbmi_metadata_gif (
  "imageId" bigint primary key not null,
  "repeatCount" integer not null,
  "colorTableMode" "tbmi_metadata_gif_colorTableMode" not null,
  "globalColorTableLength" integer not null,
  comments json
);
```

8.1. 按照`7.1.3.`
```pgsql
SELECT COUNT(*) FROM mysql."tbmi_metadata_gif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_gif" b ON
    (a."imageId" = b."imageId") AND
    (a."repeatCount" = b."repeatCount") AND
    (a."colorTableMode" = b."colorTableMode") AND
    (a."globalColorTableLength" = b."globalColorTableLength") AND
    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
WHERE (a."imageId" IS NULL AND a."repeatCount" IS NULL AND a."colorTableMode" IS NULL AND a."globalColorTableLength" IS NULL AND a."comments" IS NULL)
   OR (b."imageId" IS NULL AND b."repeatCount" IS NULL AND b."colorTableMode" IS NULL AND b."globalColorTableLength" IS NULL AND b."comments" IS NULL);
```
有
```c
ERROR: unsupported Unicode escape sequence
Detail: \u0000 cannot be converted to text.
Where: JSON data, line 1: ...                   ", "Created by TechSmith\u0000...
```

8.1.1. 但按照`7.1.4.`就无差异
```diff
-    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
+    (a."comments"::text = b."comments"::text OR (a."comments" IS NULL AND b."comments" IS NULL))
```

8.1.2. 这是由于`jsonb`类型不同于本质`text`的`json`，会将json解析后存储为pgsql内置类型之集，再由于`text`类型受`c人最爱的` https://en.wikipedia.org/wiki/Null-terminated_string 影响不支持存储`256个字节`中的唯一一个`0x00`导致`jsonb`也无法存储`\u0000`转义后的`0x00`进其内部的`text` https://github.com/turbot/steampipe-postgres-fdw/issues/118
https://www.postgresql.org/docs/current/datatype-json.html
> The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type)

> Table 8.23. JSON Primitive Types and Corresponding PostgreSQL Types
> | JSON primitive type | PostgreSQL type | Notes |
> |-|-|-|
> | string | text | \u0000 is disallowed, as are Unicode escapes representing characters not available in the database > encoding |
> | number | numeric | NaN and infinity values are disallowed |
> | boolean | boolean | Only lowercase true and false spellings are accepted |
> | null | (none) | SQL NULL is a different concept |

8.2. 该表该列中共有如下`\u0000` https://codepoints.net/U+0000
```pgsql
SELECT COUNT(*) count, comments::text c
FROM ONLY tbm."tbmi_metadata_gif"
WHERE comments::text LIKE '%\u0000%'
GROUP BY c
ORDER BY count DESC;
```
| count | c |
| :--- | :--- |
| 31 | `["                             ", "Created by TechSmith\u0000"]` |
| 2 | `["tvc\u0000"]` |
| 1 | `["Author:\t\tGod\_job\_dave\u0000\u0000\u0000???\u0001\u0000\u0000\u0000\u0000P??\u0001\u0000"]` |
| 1 | `["Optimized by Ulead SmartSaver!\u0000"]` |
| 1 | `["CREATOR: gd-jpeg v1.0 $using IJG JPEG v62$, quality = 75\n\u0000"]` |

8.2.1. 使用如下3层如同`6.`中[codegen](https://en.wikipedia.org/wiki/Code_generation)迫真[`元编程`](https://en.wikipedia.org/wiki/Metaprogramming)自我复制的[`quine病毒`](https://en.wikipedia.org/wiki/Quine_(computing))可在mysql端检查所有含有`0x00`的`text`类型及其各个长度变种 https://stackoverflow.com/questions/13932750/tinytext-text-mediumtext-and-longtext-maximum-storage-sizes 列并生成`ALTER`列类型为`BLOB`（pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`会转换为可存储所有`256个字节`的`bytea`）和`SELECT`查阅的`一键生成`mysql：
```mysql
SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
'SELECT t.*, CONCAT(
    \'ALTER TABLE tbm.`\',
    t,
    \'` MODIFY `\',
    c,
    \'` \',
    REPLACE(COLUMN_TYPE, \'text\', \'blob\'),
    CASE WHEN COLUMNS.IS_NULLABLE = \'true\' THEN \' NULL;\'
    ELSE \' NOT NULL;\'
    END
) `alter`, CONCAT(
    \'SELECT * FROM tbm.`\',
    t,
    \'` WHERE CAST(`\',
    c,
    \'` AS BINARY) LIKE CONCAT(\'\'%\'\', 0x00, \'\'%\'\');\'
) `select`
FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) count, \'',
        TABLE_NAME,
        '\' t, \'',
        COLUMN_NAME,
        '\' c FROM `',
        TABLE_NAME,
        '` WHERE CAST(`',
        COLUMN_NAME,
        '` AS BINARY) LIKE CONCAT(\'%\', 0x00, \'%\')'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t
JOIN information_schema.COLUMNS
  ON count != 0 AND TABLE_SCHEMA = \'tbm\'
 AND TABLE_NAME = t AND COLUMN_NAME = c;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE LIKE '%text%'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;
```
REGEXP部分同`2.1.`
| count | t | c | alter | select |
|-|-|-|-|-|
| 4321  | tbmi_metadata_embedded_exif | userComment | ``ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `userComment` BLOB NOT NULL;`` | ``SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`userComment` AS BINARY) LIKE CONCAT('%', 0x00, '%');`` |
| 2612  | tbmi_metadata_embedded_exif | xpAuthor    | ``ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `xpAuthor` BLOB NOT NULL;`` | ``SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`xpAuthor` AS BINARY) LIKE CONCAT('%', 0x00, '%');`` |
```c
0.00s user 0.02s system 0% cpu 42:49.75 total
```

8.3. 值得注意的是`2.2.`导入 https://github.com/dimitri/pgloader/issues/1573 不会有截断`0x00`后字节的`WARNING`且`6.`校验 https://github.com/EnterpriseDB/mysql_fdw/issues/299 过程中也无法发现以`0x00`结尾`text`中的最后一个`0x00`字节消失

9. 设有表`tbmi_hash`结构：
```mysql
CREATE TABLE `tbmi_hash` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `pHash` binary(8) NOT NULL,
  `averageHash` bigint unsigned NOT NULL,
  `blockMeanHash` binary(32) NOT NULL,
  `marrHildrethHash` binary(72) NOT NULL,
  `thumbHash` tinyblob NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
```
由pgloader在pgsql端按照`2.3.`中`mysql://`源下默认的`CASTING`所创建：
```pgsql
create table tbm.tbmi_hash (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "pHash" bytea not null,
  "averageHash" bigint not null,
  "blockMeanHash" bytea not null,
  "marrHildrethHash" bytea not null,
  "thumbHash" bytea not null,
  primary key ("imageId", "frameIndex")
);
```

9.1. 由于pgsql没有mysql特有（`isosql`不要求）带`unsigned`修饰的的int类型 https://stackoverflow.com/questions/20810134/why-unsigned-integer-is-not-available-in-postgresql/59802732#59802732 正如同`m$ft精神mvp`最爱的`中小实体企业c#工控上位机人上壬` https://z.n0099.net/#narrow/near/94726 除非使用扩展 https://github.com/petere/pguint 因而
```diff
-  `averageHash` bigint unsigned NOT NULL,
+  "averageHash" bigint not null,
```
导致该列上限值从 $2^{64} = 18446744073709551615$ 降到 $2^{63} = 9223372036854775807$

9.2. 实际上所谓的`opencv_imghash` https://github.com/Starry-OvO/aiotieba/pull/63#discussion_r1056941392 之`averageHash` https://github.com/n0099/open-tbm/blob/252710bf61ca63cc78eeb6373b4c564e8dbb0217/c%23/imagePipeline/src/Consumer/HashConsumer.cs#L19 本质上更类似于 https://en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有`可运算的正整数`语义之类型存储而应直接视作`斑点二进制` https://en.wikipedia.org/wiki/Binary_large_object
```mysql
ALTER TABLE tbmi_hash ADD COLUMN averageHash_2 BINARY(8) NOT NULL AFTER averageHash;
UPDATE tbmi_hash SET averageHash_2 = UNHEX(LPAD(HEX(averageHash), 16, '00')); -- https://stackoverflow.com/questions/58611436/how-do-you-convert-an-integer-into-its-condensed-binary-equivalent-e-g-binary3/58657797#comment138021380_58657797
SELECT COUNT(*) FROM tbmi_hash WHERE averageHash_2 != UNHEX(LPAD(HEX(averageHash), 16, '00'));
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_hash DROP COLUMN averageHash, RENAME COLUMN averageHash_2 TO averageHash;
```

9.3. `一键生成`找出所有存在值 $> 2^{63}$ 的`bigint unsigned`列的mysql
```mysql
SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
    'SELECT * FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) c, \'',
        TABLE_NAME,
        '\' t FROM `',
        TABLE_NAME,
        '` WHERE `',
        COLUMN_NAME,
        '` > 9223372036854775807'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t WHERE c != 0;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE = 'bigint unsigned'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;
```
REGEXP部分同`2.1.`

9.4.
```pgsql
DROP TABLE tbm."tbmi_hash" CASCADE;
```
类比`7.1.5.`重新导入该表
```pgsql
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_hash"()
    INHERITS (tbm."tbmi_hash")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmi_hash');
```

# Comment is too long (maximum is 65536 characters)

count	c
31	`[" ", "Created by TechSmith\u0000"]`
2	`["tvc\u0000"]`
1	`["Author:\t\tGod\_job\_dave\u0000\u0000\u0000???\u0001\u0000\u0000\u0000\u0000P??\u0001\u0000"]`
1	`["Optimized by Ulead SmartSaver!\u0000"]`
1	`["CREATOR: gd-jpeg v1.0 \(using IJG JPEG v62\), quality = 75\n\u0000"]`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用pgloader将mysql迁移至pgsql并使用mysql_fdw校验表数据 #48

Comment is too long (maximum is 65536 characters)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pid	postTime	postTime
124498114574	2019-03-10 12:38:17.000000 +00:00	2019-03-10 04:38:17.000000 +00:00
124498200265	2019-03-10 12:44:12.000000 +00:00	2019-03-10 04:44:12.000000 +00:00
...

pid	postTime	postTime
124498114574	2019-03-10 12:38:17.000000 +08:00	2019-03-10 12:38:17.000000 +08:00
124498200265	2019-03-10 12:44:12.000000 +08:00	2019-03-10 12:44:12.000000 +08:00

JSON primitive type	PostgreSQL type	Notes
string	text	\u0000 is disallowed, as are Unicode escapes representing characters not available in the database > encoding
number	numeric	NaN and infinity values are disallowed
boolean	boolean	Only lowercase true and false spellings are accepted
null	(none)	SQL NULL is a different concept

count	t	c	alter	select
4321	tbmi_metadata_embedded_exif	userComment	ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `userComment` BLOB NOT NULL;	SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`userComment` AS BINARY) LIKE CONCAT('%', 0x00, '%');
2612	tbmi_metadata_embedded_exif	xpAuthor	ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `xpAuthor` BLOB NOT NULL;	SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`xpAuthor` AS BINARY) LIKE CONCAT('%', 0x00, '%');