Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sqlite can end up with valid utf8 sequences which describe invalid codepoints, which break synapse_port_db #3538

Open
@ara4n

Description

Error ends up looking like:

2018-07-15 22:54:50,837 - synapse.metrics - 256 - INFO - Collecting gc 0
2018-07-15 22:54:50,936 - synapse_port_db - 562 - ERROR -
Traceback (most recent call last):
File "/usr/bin/synapse_port_db", line 552, in run
consumeErrors=True,
FirstError: FirstError[#16, [Failure instance: Traceback: <class 'psycopg2.DataError'>: invalid byte sequence for encoding "UTF8": 0xed 0xb3 0xb6

/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:434:errback
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:501:_startRunCallbacks
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:588:_runCallbacks
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1184:gotResult
--- <exception caught here> ---
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1126:_inlineCallbacks
/usr/lib/python2.7/dist-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/bin/synapse_port_db:269:handle_table
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1126:_inlineCallbacks
/usr/lib/python2.7/dist-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/bin/synapse_port_db:428:handle_search_table
/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py:246:inContext
/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py:262:<lambda>
/usr/lib/python2.7/dist-packages/twisted/python/context.py:118:callWithContext
/usr/lib/python2.7/dist-packages/twisted/python/context.py:81:callWithContext
/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:298:_runWithConnection
/usr/bin/synapse_port_db:138:r
/usr/bin/synapse_port_db:415:insert
/usr/lib/python2.7/dist-packages/synapse/storage/_base.py:90:executemany
/usr/lib/python2.7/dist-packages/synapse/storage/_base.py:117:_do_execute
]]

in the event_search logic.

Turns out that 0xed 0xb3 0xb6 is valid utf8, but describes \uDCF7 which is not a valid defined codepoint, which postgres barfs on when you try to insert it.

Python2 doesn't recognise there being anything invalid about it, however.

The workaround in the end was to use iconv_codecs to use iconv to strip invalid codepoints out of the string before handing to postgres, with something like:

row["value"].encode("iconv:utf8", "ignore").decode("utf8")

Which seemed to work on linux, but fails on macOS.

Thanks to @flux:matrix.org for reporting and debugging this!

The original cause of the bad data is #3537

Metadata

Assignees

No one assigned

    Labels

    A-DatabaseDB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the dbO-UncommonMost users are unlikely to come across this or unexpected workflowS-MinorBlocks non-critical functionality, workarounds exist.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions